.. _`data-between-computes`: ===================================== Copying Data Between Compute Clusters ===================================== .. contents:: :depth: 2 :local: Target Audience --------------- This document describes a procedure for transferring data between The McDonnell Genome Institute's **compute0** cluster and WUIT RIS **compute1** cluster. * You must have login credentials for both compute environments. * You must have read/write permissions to the relevant storage volumes. * Be mindful of your `$USER` name on both clusters, some users have differing user IDs. Build or find a container with ssh and rsync present ---------------------------------------------------- This Dockerfile constructs an Ubuntu based container with rsync and openssh. :: cat >> Dockerfile <> ~/.ssh/authorized_keys Create an sshd_config that refers to the path to the above SSH keys: :: cat > ~/etc/sshd_config <> ~/etc/sshd_entrypoint.sh < is submitted to queue . <> <> latest: Pulling from mcallaway/rsync 5c939e3a4d10: Already exists c63719cdbe7a: Already exists 19a861ea6baf: Already exists 651c9d2d6c4f: Already exists bf91b5efbfd8: Pull complete bb3a7dd7dc67: Pull complete Digest: sha256:2366f9b805855764fa7202aaf3f29b5ced4c7af2463fa7570b9ea73a7eb72e58 Status: Downloaded newer image for mcallaway/rsync:latest docker.io/mcallaway/rsync:latest Starting sshd as mcallawa .. note:: Interactive jobs will soon have "runtime limits" on the order of 3 days (to be determined). Non-interactive (batch) jobs will have much longer runtime limits, likely 6 weeks. Be wary of "losing" (forgetting about) running jobs, but note they'll be killed sooner or later. Be mindful of this in large (multi-day) transfers. Use log files to keep track of batch jobs, using `bsub -eo job.err -oo job.out ...` files for stdout and stderr. Launch an rsync job into compute0 --------------------------------- Now launch an rsync job in compute0 to "push" to compute1. Note the use of environment variables here to specify the path from which your data is coming, the host and port involved, and the use of the "host" network for Docker: :: export LSF_DOCKER_VOLUMES="/gscmnt/temp403:/gscmnt/temp403" export LSF_DOCKER_NETWORK=host bsub -q research-hpc -Is -a 'docker(mcallaway/rsync:latest)' bash Job <2665534> is submitted to queue . <> <> latest: Pulling from mcallaway/rsync 5c939e3a4d10: Already exists c63719cdbe7a: Already exists 19a861ea6baf: Already exists 651c9d2d6c4f: Already exists bf91b5efbfd8: Already exists bb3a7dd7dc67: Pull complete Digest: sha256:2366f9b805855764fa7202aaf3f29b5ced4c7af2463fa7570b9ea73a7eb72e58 Status: Downloaded newer image for mcallaway/rsync:latest mcallawa@blade18-2-2:~$ HOST=compute1-exec-163.ris.wustl.edu # The compute1 exec node above mcallawa@blade18-2-2:~$ PORT=8200 mcallawa@blade18-2-2:~$ rsync --archive --whole-file --verbose --stats --progress -e "ssh -p $PORT -i $HOME/.ssh/ssh_user_rsa_key" /gscmnt/temp403/systems/git_srv.tar.gz $USER@$HOST:/storage1/fs1/mcallawa/Active/data/ sending incremental file list git_srv.tar.gz 32,555,073,536 97% 124.27MB/s 0:00:05 .. note:: The same warnings apply here with job duration, termination at runtime limits, and the use of output and error files. .. note:: You can also simply use `scp -P $PORT -i $HOME/.ssh/ssh_user_rsa_key $SRC $USER@$HOST:$DEST` Note that rsync has some computational overhead, is tar over ssh faster? ------------------------------------------------------------------------ Instead of using "rsync", one can also use "tar over ssh". Note here the use of "pv" to calculate the rate of data crossing a pipe for a "progress bar": :: mcallawa@blade18-2-2:~$ HOST=compute1-exec-163.ris.wustl.edu mcallawa@blade18-2-2:~$ PORT=8200 mcallawa@blade18-2-2:~$ tar cf - /gscmnt/temp403/systems/mcallawa/data/ | pv | ssh -p $PORT -i ~/.ssh/ssh_user_rsa_key $USER@$HOST 'tar xf -' tar: Removing leading `/' from member names 4.27GiB 0:00:29 [ 148MiB/s] [ <=> ] Caveats ------- * We observe a sigle, single-threaded rsync job to transfer at around 150MB/s. * Use more than one job across different hosts. * Cumulative bandwidth across these two clusters is 2x40Gb. * Launch several jobs across different pairs of hosts to parallelize, but remember this is a shared system, be mindful of others. We as a community need to try to "add up" to a cumulative total of about 80G of network consumption. This is hard without QoS tools. * Be careful with your file paths, strip "/" where needed. * Rsync and tar will preserve symbolic links, where Globus and Samba do not. * Many people are likely to use this process, you'll need to pick a network port not in use. Try this out by using bhosts to find out if a port is open: :: # Show me all hosts in the "general" host group with port 8200 open bhosts -w -R 'select[port8200=1]' general