Copying Data Between Compute Clusters

Target Audience

This document describes a procedure for transferring data between The McDonnell Genome Institute’s compute0 cluster and WUIT RIS compute1 cluster.

  • You must have login credentials for both compute environments.

  • You must have read/write permissions to the relevant storage volumes.

  • Be mindful of your $USER name on both clusters, some users have differing user IDs.

Build or find a container with ssh and rsync present

This Dockerfile constructs an Ubuntu based container with rsync and openssh.

cat >> Dockerfile <<EOF
FROM ubuntu

# Tell apt-get that we're not paying attention
ENV DEBIAN_FRONTEND noninteractive

# First layer: an up to date starting base OS
RUN sed -i 's/^# deb /deb /' /etc/apt/sources.list \
  && apt-get update

# Next: Add desired packages and clean up
RUN apt-get install -y --no-install-recommends \
    libnss-sss \
    openssh-client openssh-server rsync \
  && apt-get clean all \
  && rm -r /var/lib/apt/lists/*
EOF

Push the container to Docker Hub:

docker build . -t $(REGISTRY)$(OWNER)/$(NAME):latest
docker push $(REGISTRY)$(OWNER)/$(NAME):latest

Note

Feel free to use my container mcallaway/rsync:latest

Prepare an SSH key on the compute cluster client nodes

We need to create a set of SSH keys. A “user” key for the sending ssh client side, and a “host” key for the ssh server side. You can create both keys in your $HOME directory on compute1, then copy them over to compute0 so they exist on both sides. The sender only needs the user key and the server only needs the host key, but the instructions copy them all just for uniformity.

So, on compute1:

cd $HOME
mkdir ./etc

Create a key for use by sshd server:

ssh-keygen -t rsa -f etc/ssh_host_rsa_key -N ''

Create a key for use by ssh client:

ssh-keygen -t rsa -f etc/ssh_user_rsa_key -N ''

Add this user key to your ~/.ssh/authorized_keys file on the compute1 side:

cat etc/ssh_user_rsa_key.pub >> ~/.ssh/authorized_keys

Create an sshd_config that refers to the path to the above SSH keys:

cat > ~/etc/sshd_config <<EOF
Port 22 # Override this by passing PORT to sshd_entrypoint.sh
HostKey /home/mcallawa/etc/ssh_host_rsa_key
PidFile /home/mcallawa/etc/sshd.pid
PasswordAuthentication no
ChallengeResponseAuthentication no
GSSAPICleanupCredentials no
EOF

Verify permissions on your ~/.ssh and ~/etc contents:

chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~/etc
chmod 600 ~/etc/*

Have a wrapper to run sshd as you:

cat >> ~/etc/sshd_entrypoint.sh <<EOF
#!/bin/bash
PORT=\$1
while true; do
  echo Starting sshd as \$USER
  /usr/sbin/sshd -f etc/sshd_config -D -p \$PORT
  echo sshd exited...
  sleep 3
done
EOF

Now copy the keys etc/ssh_user_rsa_key and etc/ssh_host_rsa_key to the compute0 side by “catting” them and cutting and pasting them into a text editor on compute0.

Launch a sshd job into compute1

Select a port between 8000 and 8999, and start your sshd, making note of the execution node it lands on.

Note

Note here that we expect $HOME to have been passed into the Docker container for the job. This happens automatically if your current working directory happens to be your $HOME, but if you are not launching these jobs from your $HOME, you will have to add $HOME to the list of LSF_DOCKER_VOLUMES.

LSF_DOCKER_VOLUMES="/storage1/fs1/mcallawa:/storage1/fs1/mcallawa" LSF_DOCKER_PORTS='8200:8200' bsub -Is -G compute-ris -q general-interactive -R 'select[port8200=1]' -a 'docker(mcallaway/rsync:latest)' bash ./etc/sshd_entrypoint.sh 8200
Job <60223> is submitted to queue <general-interactive>.
<<Waiting for dispatch ...>>
<<Starting on compute1-exec-163.ris.wustl.edu>>
latest: Pulling from mcallaway/rsync
5c939e3a4d10: Already exists
c63719cdbe7a: Already exists
19a861ea6baf: Already exists
651c9d2d6c4f: Already exists
bf91b5efbfd8: Pull complete
bb3a7dd7dc67: Pull complete
Digest: sha256:2366f9b805855764fa7202aaf3f29b5ced4c7af2463fa7570b9ea73a7eb72e58
Status: Downloaded newer image for mcallaway/rsync:latest
docker.io/mcallaway/rsync:latest
Starting sshd as mcallawa

Note

Interactive jobs will soon have “runtime limits” on the order of 3 days (to be determined). Non-interactive (batch) jobs will have much longer runtime limits, likely 6 weeks. Be wary of “losing” (forgetting about) running jobs, but note they’ll be killed sooner or later. Be mindful of this in large (multi-day) transfers. Use log files to keep track of batch jobs, using bsub -eo job.err -oo job.out … files for stdout and stderr.

Launch an rsync job into compute0

Now launch an rsync job in compute0 to “push” to compute1. Note the use of environment variables here to specify the path from which your data is coming, the host and port involved, and the use of the “host” network for Docker:

export LSF_DOCKER_VOLUMES="/gscmnt/temp403:/gscmnt/temp403"
export LSF_DOCKER_NETWORK=host
bsub -q research-hpc -Is -a 'docker(mcallaway/rsync:latest)' bash
Job <2665534> is submitted to queue <research-hpc>.
<<Waiting for dispatch ...>>
<<Starting on blade18-2-2.gsc.wustl.edu>>
latest: Pulling from mcallaway/rsync
5c939e3a4d10: Already exists
c63719cdbe7a: Already exists
19a861ea6baf: Already exists
651c9d2d6c4f: Already exists
bf91b5efbfd8: Already exists
bb3a7dd7dc67: Pull complete
Digest: sha256:2366f9b805855764fa7202aaf3f29b5ced4c7af2463fa7570b9ea73a7eb72e58
Status: Downloaded newer image for mcallaway/rsync:latest
mcallawa@blade18-2-2:~$ HOST=compute1-exec-163.ris.wustl.edu # The compute1 exec node above
mcallawa@blade18-2-2:~$ PORT=8200
mcallawa@blade18-2-2:~$ rsync --archive --whole-file --verbose --stats --progress -e "ssh -p $PORT -i $HOME/.ssh/ssh_user_rsa_key" /gscmnt/temp403/systems/git_srv.tar.gz $USER@$HOST:/storage1/fs1/mcallawa/Active/data/
sending incremental file list
git_srv.tar.gz
 32,555,073,536  97%  124.27MB/s    0:00:05

Note

The same warnings apply here with job duration, termination at runtime limits, and the use of output and error files.

Note

You can also simply use scp -P $PORT -i $HOME/.ssh/ssh_user_rsa_key $SRC $USER@$HOST:$DEST

Note that rsync has some computational overhead, is tar over ssh faster?

Instead of using “rsync”, one can also use “tar over ssh”. Note here the use of “pv” to calculate the rate of data crossing a pipe for a “progress bar”:

mcallawa@blade18-2-2:~$ HOST=compute1-exec-163.ris.wustl.edu
mcallawa@blade18-2-2:~$ PORT=8200
mcallawa@blade18-2-2:~$ tar cf - /gscmnt/temp403/systems/mcallawa/data/ | pv | ssh -p $PORT -i ~/.ssh/ssh_user_rsa_key $USER@$HOST 'tar xf -'
tar: Removing leading `/' from member names
4.27GiB 0:00:29 [ 148MiB/s] [                        <=>          ]

Caveats

  • We observe a sigle, single-threaded rsync job to transfer at around 150MB/s.

  • Use more than one job across different hosts.

  • Cumulative bandwidth across these two clusters is 2x40Gb.

  • Launch several jobs across different pairs of hosts to parallelize, but remember this is a shared system, be mindful of others. We as a community need to try to “add up” to a cumulative total of about 80G of network consumption. This is hard without QoS tools.

  • Be careful with your file paths, strip “/” where needed.

  • Rsync and tar will preserve symbolic links, where Globus and Samba do not.

  • Many people are likely to use this process, you’ll need to pick a network port not in use. Try this out by using bhosts to find out if a port is open:

# Show me all hosts in the "general" host group with port 8200 open
bhosts -w -R 'select[port8200=1]' general