Launch and set up AMD MI100 server - by hand

Note: if you set up your AMD MI100 server using the python-chi notebook, you will skip this section! This section describes how to set up the server “by hand”, in case you do not have access to the Chameleon Jupyter environment in which to run the python-chi notebook, or in case you prefer to do it “by hand”.

At the beginning of the lease time, we will bring up our GPU server. We will use Horizon GUI at CHI@TACC to provision our server.

To access this interface,

You will be prompted to set up your instance step by step using a graphical “wizard”.

Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

You will see your instance appear in the list of compute instances. Within 10-20 minutes, it should go to the “Running” state.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

Now, you should be able to access your instance over SSH! Test it now. From your local terminal, run

ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where

You will run the rest of the commands in this section inside your SSH session on “node-mltrain”.

Retrieve code and notebooks on the instance

We’ll start by retrieving the code and other materials on the instance.

# run on node-mltrain
git clone --recurse-submodules https://github.com/teaching-on-testbeds/mltrain-chi

Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

# run on node-mltrain
curl -sSL https://get.docker.com/ | sudo sh

Then, give the cc user permission to run docker commands:

# run on node-mltrain
sudo groupadd -f docker; sudo usermod -aG docker $USER

After running this command, for the change in permissions to be effective, you must open a new SSH session - use exit and then reconnect. When you do, the output of id should show that you are a member of the docker group..

Set up the AMD GPU

Before we can use the AMD GPUs, we need to set up the driver using the amdgpu-install utility.

Let’s follow AMD’s instructions for setting up amdgpu-install:

# run on node-mltrain
sudo apt update
wget https://repo.radeon.com/amdgpu-install/6.3.3/ubuntu/noble/amdgpu-install_6.3.60303-1_all.deb
sudo apt -y install ./amdgpu-install_6.3.60303-1_all.deb
sudo apt update

To run containers using ROCm (Radeon Open Compute Platform), an open-source software stack from AMD that allows users to program AMD GPUs (similar to NVIDIA’s CUDA), we need to install the amdgpu-dkms driver:

# run on node-mltrain
amdgpu-install --usecase=dkms

And, we’ll also install the rocm-smi utility, so that we can monitor the GPU from the host:

# run on node-mltrain
sudo apt -y install rocm-smi

Finally, we will add the cc user to the video and render groups, which are needed for access to the GPU:

# run on node-mltrain
sudo usermod -aG video,render $USER

That’s all we will need on the host - the rest of ROCm will be installed inside the containers.

To apply the changes to the kernel, we need to reboot:

# run on node-mltrain
sudo reboot

wait until the host comes back up, then reconnect using SSH. Run

# run on node-mltrain
rocm-smi

and verify that you can see the GPU(s).

We can also install nvtop to monitor GPU usage - we’ll install from source, because the older version in the Ubuntu package repositories does not support AMD GPUs:

# run on node-mltrain
sudo apt -y install cmake libncurses-dev libsystemd-dev libudev-dev libdrm-dev libgtest-dev
git clone https://github.com/Syllo/nvtop
mkdir -p nvtop/build && cd nvtop/build && cmake .. -DAMDGPU_SUPPORT=ON && sudo make install
cd ~  # return to home directory

Build a container image - for MLFlow section

Finally, we will build a container image in which to work in the MLFlow section, that has:

You can see our Dockerfile for this image at: Dockerfile.jupyter-torch-mlflow-rocm

Building this container will take a very long time (ROCm is huge). But that’s OK: we can get it started and then continue to the next section while it builds in the background, since we don’t need this container immediately. We just need it to finish by the “Start a Jupyter server” subsection of the “Start the tracking server” section.

# run on node-mltrain
docker build -t jupyter-mlflow -f mltrain-chi/docker/Dockerfile.jupyter-torch-mlflow-rocm .

In the meantime, open another SSH session on “node-mltrain”, so that you can continue with the next section.