Launch and set up NVIDIA A100 40GB server - by hand

Note: if you set up your NVIDIA A100 server using the python-chi notebook, you will skip this section! This section describes how to set up the server “by hand”, in case you do not have access to the Chameleon Jupyter environment in which to run the python-chi notebook, or in case you prefer to do it “by hand”.

At the beginning of the lease time, we will bring up our GPU server. We will use Horizon GUI at CHI@TACC to provision our server.

To access this interface,

You will be prompted to set up your instance step by step using a graphical “wizard”.

Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

You will see your instance appear in the list of compute instances. Within 10-20 minutes, it should go to the “Running” state.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

Now, you should be able to access your instance over SSH! Test it now. From your local terminal, run

ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where

You will run the rest of the commands in this section inside your SSH session on “node-mltrain”.

Retrieve code and notebooks on the instance

We’ll start by retrieving the code and other materials on the instance.

# run on node-mltrain
git clone --recurse-submodules https://github.com/teaching-on-testbeds/mltrain-chi

Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

# run on node-mltrain
curl -sSL https://get.docker.com/ | sudo sh

Then, give the cc user permission to run docker commands:

# run on node-mltrain
sudo groupadd -f docker; sudo usermod -aG docker $USER

After running this command, for the change in permissions to be effective, you must open a new SSH session - use exit and then reconnect. When you do, the output of id should show that you are a member of the docker group..

Set up the NVIDIA container toolkit

We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.

# run on node-mltrain
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  
sudo apt update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
# for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
sudo jq 'if has("exec-opts") then . else . + {"exec-opts": ["native.cgroupdriver=cgroupfs"]} end' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp > /dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json
sudo systemctl restart docker

and we can install nvtop to monitor GPU usage:

# run on node-mltrain
sudo apt update
sudo apt -y install nvtop

Build a container image - for MLFlow section

Finally, we will build a container image in which to work in the MLFlow section, that has:

You can see our Dockerfile for this image at: Dockerfile.jupyter-torch-mlflow-cuda

Building this container may take a bit of time, but that’s OK: we can get it started and then continue to the next section while it builds in the background, since we don’t need this container immediately.

# run on node-mltrain
docker build -t jupyter-mlflow -f mltrain-chi/docker/Dockerfile.jupyter-torch-mlflow-cuda .

In the meantime, open another SSH session on “node-mltrain”, so that you can continue with the next section.