.. _`parallel-computing`: ================== Parallel Computing ================== .. contents:: :depth: 1 :local: .. _`parallel-computing-mpi`: MPI Parallel Jobs ----------------- MPI jobs are scheduled largely like any other job. To facilitate communication between parallel jobs run both on the same host and on different hosts, you'll need to enable host networking and IPC modes with the ``LSF_DOCKER_NETWORK`` and ``LSF_DOCKER_IPC`` environment variables, as well as request job slots and resources sufficient to run the job. Below is a `hypothetical` example of how to run an mpi_hello_world MPI application with 2 processes forced to run on separate hosts: .. code:: > export LSF_DOCKER_NETWORK=host > export LSF_DOCKER_IPC=host > bsub -n 2 -R 'affinity[core(1)] span[ptile=1]' -I -q general-interactive \ -a 'docker(joe.user/ubuntu16.04_openmpi_lsf:1.1)' /usr/local/mpi/bin/mpirun /usr/local/bin/mpi_hello_world Job <2042> is submitted to queue . <> <> 1.1: Pulling from joe.user/ubuntu16.04_openmpi_lsf Digest: sha256:5661e21c7f20ea1b3b14537d17078cdc288c4d615e82d96bf9f2057366a0fb8f Status: Image is up to date for joe.user/ubuntu16.04_openmpi_lsf:1.1 docker.io/joe.user/ubuntu16.04_openmpi_lsf:1.1 Hello world from processor compute1-exec-3.ris.wustl.edu, rank 0 out of 2 processors Hello world from processor compute1-exec-9.ris.wustl.edu, rank 1 out of 2 processors .. _`parallel-computing-gpu`: GPU Parallel Jobs ----------------- There are execution nodes in the cluster with GPUs. These are "resources" in the parlance of the cluster manager. Simply add the ``-gpu`` flag to make them visible. .. Note:: We provide services for the entire WashU campus and as such, we suggest a best practices of not over allocating how many resources a single user utilizes. To that point, we suggest that a single user take up no more than 10 GPUs at a time. Below is a `hypothetical` example using Tensorflow 2: .. code:: #! /usr/bin/env python # docker run --gpus all -it tensorflow/tensorflow:latest-gpu /usr/local/bin/python tf.test.py from __future__ import absolute_import, division, print_function, unicode_literals import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test, verbose=2) Here we save the above code as a file named `tf2.py` and submit to the cluster: .. Note:: - If you are using other options that are part of the ``-R`` option, you will need to list ``gpuhost`` first. - If you are using ``select`` for another option, you will need to place ``gpuhost`` within that and first. - ``-R select[gpuhost && port8888=1]`` .. code:: > bsub -Is -q general-interactive -R 'gpuhost' -gpu "num=4:gmodel=TeslaV100_SXM2_32GB" -a 'docker(tensorflow/tensorflow:latest-gpu)' /bin/bash -c "CUDA_VISIBLE_DEVICES=0,1,2,3; python tf2.py; ... lots of output ... 2019-11-01 22:07:25.455050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2019-11-01 22:07:25.607212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:18:00.0 2019-11-01 22:07:25.608657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:3b:00.0 2019-11-01 22:07:25.610081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:86:00.0 2019-11-01 22:07:25.611511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 ... Note that in this particular example, the Tensorflow 2 container expects the use of the CUDA_VISIBLE_DEVICES environment variable. This behavior will likely vary with different software containers.