Moving Data From Google Storage to RIS storage via gsutil on Compute1

What is this Documentation?

This documentation will cover doing file transfers with gsutil over our dedicated fiber interconnect in order to download data from sources that use Google storage.

storageN

  • The use of storageN within these documents indicates that any storage platform can be used.

  • Current available storage platforms:
    • storage1

    • storage2

Quick Start

1. Login to the Compute platform

ssh wustlkey@compute1-client-1.ris.wustl.edu

2. Set up Google Account variable

export GOOGLE_ACCOUNT=wustlkey@wustl.edu

3. Login with gcloud

gcloud auth login $GOOGLE_ACCOUNT
  • Follow the URL that you are given.

../../_images/gcloud.auth.login.png
  • It will ask your to sign into Google Cloud SDK. Use the email granted permission by the data owner here, e.g. wustlkey@wustl.edu.

../../_images/gcloud.website.login.png
  • This will take you to a WashU authentication page if it’s your WashU email. Put in your information for the WashU sign in you normally would.

../../_images/washu.website.login.png
  • It will then ask for access to your account, click Allow.

../../_images/gcloud.access.png
  • It will give you a code, you will need to paste this code back into the terminal you are working in.

../../_images/gcloud.auth.code.png ../../_images/gcloud.enter.code.png
  • You can confirm this worked with the following command.

..code:

gcloud auth list
../../_images/gcloud.auth.list.png

Multiple Accounts

  • If this account is the only account listed, it will by default be the “active” one.

  • If there are multiple accounts, you can use the following to set the one being used active.

gcloud config set account $GOOGLE_ACCOUNT

4. Transferring the Data

  • Set the following variables needed for the transfer.

export GOOGLE_STORAGE_PATH=gs://path/from/data/provider
export DESTINATION_DIR=/storageN/fs1/${STORAGE_ALLOCATION}/Active/path/to/directory

//path/from/data/provider

//path/from/data/provider is the location of the data on Google Storage, as provided by the group sharing the data.

  • You will need to set the following variables

export LSB_JOB_REPORT_MAIL=N
export LSF_DOCKER_VOLUMES="/storageN/fs1/${STORAGE_ALLOCATION}/Active/path/to/directory:/data"
export LSF_DOCKER_ADD_HOST=storage.googleapis.com:199.36.153.4
  • You need to launch a bsub job to use google cloud tools

bsub -Is -q general-interactive -a 'docker(google/cloud-sdk)' /bin/bash

You are a member of multiple LSF User Groups

If you are a member of more than one compute group, you will be prompted to specify an LSF User Group with -G group_name or by setting the LSB_SUB_USER_GROUP variable.

  • The following command will run a trial of the transfer to make sure it works.

gsutil rsync -r -n $GOOGLE_STORAGE_PATH /data/
  • Once the test is complete and there are no issues, run the transfer by removing the -n option.

gsutil rsync -r $GOOGLE_STORAGE_PATH /data/
  • If there are problems or the transfer locks up, you can safely restart the transfer without losing progress as it will continue from where it was stopped just like normal rsync.

  • If you are transferring a large number of small files, using parallel transfers may work better.

  • The following command will run a transfer in parallel.

gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M -o GSUtil:parallel_thread_count=16 rsync -r $GOOGLE_STORAGE_PATH /data/

Warning

Using the parallel option is done so with the knowledge that it can be error prone.