Large-scale model training on Chameleon

In this tutorial, we will practice fine-tuning a large language model. We will use a selection of techniques to allow us to train models that would not otherwise fit in GPU memory:

gradient accumulation
reduced precision
parameter efficient fine tuning
distributed training across multiple GPUs and with CPU offload

To run this experiment, you should have already created an account on Chameleon, and become part of a project. You must also have added your SSH key to the CHI@UC site.

Experiment topology

In this experiment, we will deploy a single bare metal instance with specific NVIDIA GPU capabilities.

In the “Single GPU” section: To use bfloat16 in our experiments with reduced precision, we need a GPU with NVIDIA CUDA compute capability 8.0 or higher. For example, some Chameleon nodes have A100 or A30 GPUs, which have compute capability 8.0. If we use a V100, with compute capability 7.0, the bfloat16 capability is actually emulated in software.
In the “Multiple GPU” section: To practice distributed training with multiple GPUs, will request a node with 4 GPUs.

We can browse Chameleon hardware configurations for suitable node types using the Hardware Browser.

For example, to find nodes with 4x GPUs: if we expand “Advanced Filters”, check the “4” box under “GPU count”, and then click “View”, we can identify some suitable node types: gpu_a100_pcie, gpu_a100_nvlink, gpu_v100, or gpu_v100_nvlink at CHI@UC. (The NVLink-type nodes have a high-speed interconnect between GPUs.)

(We will avoid p100-based node types for this experiment, because the P100 has less GPU RAM, and less compute capability.)

Create a lease

To use bare metal resources on Chameleon, we must reserve them in advance. We can use the OpenStack graphical user interface, Horizon, to submit a lease for an A100 or V100 node at CHI@UC. To access this interface,

from the Chameleon website
click “Experiment” > “CHI@UC”
log in if prompted to do so
check the project drop-down menu near the top left (which shows e.g. “CHI-XXXXXX”), and make sure the correct project is selected.

If you plan to do “Single GPU” and “Multiple GPU” together in a 3-hour block:

On the left side, click on “Reservations” > “Leases”, and then click on “Host Calendar”. In the “Node type” drop down menu, change the type to gpu_a100_pcie to see the schedule of availability. You may change the date range setting to “30 days” to see a longer time scale. Note that the dates and times in this display are in UTC. You can use WolframAlpha or equivalent to convert to your local time zone.
Once you have identified an available three-hour block in UTC time that works for you in your local time zone, make a note of:
- the start and end time of the time you will try to reserve. (Note that if you mouse over an existing reservation, a pop up will show you the exact start and end time of that reservation.)
- and the node type or name of the node you want to reserve.
Then, on the left side, click on “Reservations” > “Leases”, and then click on “Create Lease”:
- set the “Name” to llm_netID where in place of netID you substitute your actual net ID.
- set the start date and time in UTC
- modify the lease length (in days) until the end date is correct. Then, set the end time. To be mindful of other users, you should limit your lease time as directed.
- Click “Next”. On the “Hosts” tab,
- check the “Reserve hosts” box
- leave the “Minimum number of hosts” and “Maximum number of hosts” at 1
- in “Resource properties”, specify the node type that you identified earlier.
- Click “Next”. Then, click “Create”. (We won’t include any network resources in this lease.)

Your lease status should show as “Pending”. Click on the lease to see an overview. It will show the start time and end time, and it will show the name of the physical host that is reserved for you as part of your lease. Make sure that the lease details are correct.

If you plan to do “Single GPU” and “Multiple GPU” separately in two 2-hour blocks:

First, make a 2-hour reservation on a node with a single A100 80GB GPU. We will use a compute_gigaio, but avoid gigaio-compute-06 which has no GPU.

On the left side, click on “Reservations” > “Leases”, and then click on “Host Calendar”. In the “Node type” drop down menu, change the type to compute_gigaio to see the schedule of availability. You may change the date range setting to “30 days” to see a longer time scale. Note that the dates and times in this display are in UTC. You can use WolframAlpha or equivalent to convert to your local time zone.
Once you have identified an available three-hour block in UTC time that works for you in your local time zone, make a note of:
- the start and end time of the time you will try to reserve. (Note that if you mouse over an existing reservation, a pop up will show you the exact start and end time of that reservation.)
- and the name of the node you want to reserve.
Then, on the left side, click on “Reservations” > “Leases”, and then click on “Create Lease”:
- set the “Name” to llm_single_netID where in place of netID you substitute your actual net ID.
- set the start date and time in UTC
- modify the lease length (in days) until the end date is correct. Then, set the end time. To be mindful of other users, you should limit your lease time as directed.
- Click “Next”. On the “Hosts” tab,
- check the “Reserve hosts” box
- leave the “Minimum number of hosts” and “Maximum number of hosts” at 1
- in “Resource properties”, specify the node name that you identified earlier.
- Click “Next”. Then, click “Create”. (We won’t include any network resources in this lease.)

Your lease status should show as “Pending”. If you click on the lease, you can see an overview, including the start time and end time, and it will show the name of the physical host that is reserved for you as part of your lease.

Next, make a 2-hour reservation on a node with 4x A100 or 4x V100 GPU. Repeat the steps above, but for an gpu_a100_pcie or gpu_v100 node type, and use the lease name llm_multi_netID (with your own net ID).

Since you will need the full lease time to actually execute your experiment, you should read all of the experiment material ahead of time in preparation, so that you make the best possible use of your GPU time.

At the beginning of your lease time, you will continue with the next step, in which you bring up a bare metal instance!

Before you begin, open this experiment on Trovi:

Use this link: Large-scale model training on Chameleon on Trovi
Then, click “Launch on Chameleon”. This will start a new Jupyter server for you, with the experiment materials already in it.

You will see several notebooks inside the llm-chi directory - look for the one titled 1_create_server.ipynb. Open this notebook and continue there.

Bring up a GPU server

At the beginning of the lease time, we will bring up our GPU server. We will use the python-chi Python API to Chameleon to provision our server.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected:

from chi import server, context, lease
import os

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@UC")

Change the string in the following cell to reflect the name of your lease (with your own net ID), then run it to get your lease:

l = lease.get_lease(f"llm_netID") # or llm_single_netID, or llm_multi_netID
l.show()

The status should show as “ACTIVE” now that we are past the lease start time.

The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting Run > Run Selected Cell and All Below from the Jupyter menu.

As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!

We will use the lease to bring up a server with the CC-Ubuntu24.04-CUDA disk image. (Note that the reservation information is passed when we create the instance!) This will take up to 10 minutes.

username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"node-llm-{username}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)

Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

s.associate_floating_ip()

s.refresh()
s.check_connectivity()

In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).

s.refresh()
s.show(type="widget")

Retrieve code and notebooks on the instance

Now, we can use python-chi to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

s.execute("git clone https://github.com/teaching-on-testbeds/llm-chi")

Set up Docker with NVIDIA container toolkit

To use common deep learning frameworks like Tensorflow or PyTorch, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")
s.execute("docker run hello-world")

We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.

# get NVIDIA container toolkit 
s.execute("curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list")
s.execute("sudo apt update")
s.execute("sudo apt-get install -y nvidia-container-toolkit")
s.execute("sudo nvidia-ctk runtime configure --runtime=docker")
# for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
s.execute("sudo jq 'if has(\"exec-opts\") then . else . + {\"exec-opts\": [\"native.cgroupdriver=cgroupfs\"]} end' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp > /dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json")
s.execute("sudo systemctl restart docker")

In the following cell, we will verify that we can see our NVIDIA GPUs from inside a container, by passing --gpus all. (The -rm flag says to clean up the container and remove its filesystem when it finishes running.)

s.execute("docker run --rm --gpus all ubuntu nvidia-smi")

Let’s pull the actual container images that we are going to use,

For the “Single GPU” section: a Jupyter notebook server with PyTorch and CUDA libraries
For the “Multiple GPU” section: a PyTorch image with NVIDIA developer tools, which we’ll need in order to install DeepSpeed

Pull and start container for “Single GPU” section

Let’s pull the container:

s.execute("docker pull quay.io/jupyter/pytorch-notebook:cuda12-pytorch-2.5.1")

and get it running:

s.execute("docker run -d -p 8888:8888 --gpus all --name jupyter quay.io/jupyter/pytorch-notebook:cuda12-pytorch-2.5.1")

Finally, run

s.execute("docker logs jupyter")

Look for the line of output in the form:

http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Paste this into a browser tab, but in place of 127.0.0.1, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface.

You will continue working from the next notebook! But, if you are planning to do the “Multiple GPU” section in the same lease, run the following cells too, to let the container image for the “Multiple GPU” section get pulled in the background. This image takes a loooooong time to pull, so it’s important to get it started and leave it running while you are working in your other tab on the “Single GPU” section.

Pull container for “Multiple GPU” section

s.execute("docker pull pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel")

and let’s also install some software on the host that we’ll use in the “Multiple GPU” section:

s.execute("sudo apt update; sudo apt -y install nvtop")

Train a large model on a single GPU - A100 80GB

In this section, we will practice strategies for training a large model on a single GPU. After completing this section, you should understand the effect of

batch size
gradient accumulation
reduced precision/mixed precision
parameter efficient fine tuning

on a large model training job.

This section requires a host with at least one A100 80GB GPUs.

This notebook will be executed inside a Jupyter interface hosted on a GPU server instance on Chameleon, NOT in the Chameleon Jupyter interface from which we launch experiments (provision servers, etc.)

Open the notebook on Colab

We should have already started a notebook server in a container on a Chameleon GPU host.

Open a terminal in that Jupyter interface and run

wget https://raw.githubusercontent.com/teaching-on-testbeds/llm-chi/refs/heads/main/workspace/2_single_gpu_a100.ipynb

to get a copy of this notebook in that workspace.

Make sure that you can see the GPUs:

!nvidia-smi

Prepare LitGPT

For this tutorial, we will fine-tune an TinyLlama or OpenLLaMA large language model using litgpt. LitGPT is a convenient wrapper around many PyTorch Lightning capabilities that makes it easy to fine-tune a GPU using a “recipe” defined in a YAML file. (We’ll also try the Python API for LitGPT in the “Multiple GPU” section of this tutorial.)

You may browse the “recipes” for this experiment in our Github repository.

Our focus will be exclusively on comparing the time and memory requirements of training jobs under different settings - we will completely ignore the loss of the fine-tuned model, and we will make some choices to reduce the overall time of our experiment (to fit in a short Chameleon lease) that wouldn’t make sense if we really needed the fine-tuned model (e.g. using a very small fraction of the training data).

First, install LitGPT:

!pip install 'litgpt[all]'==0.5.7 'lightning<2.5.0.post0'

then, download the foundation models:

!litgpt download TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

!litgpt download openlm-research/open_llama_3b

!litgpt download openlm-research/open_llama_7b

!litgpt download openlm-research/open_llama_13b

Also, get the “recipes” that we will use for LLM fine-tuning. Using the file browser on the left side, look at the contents of the “config” directory.

!git clone https://github.com/teaching-on-testbeds/llm-chi/
!mv llm-chi/workspace/config .

Experiment: Baseline

As a baseline, let’s try an epoch of fine-tuning the TinyLlama-1.1B, using full precision and a batch size of 32:

!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 32

This will fail because the training job won’t fit in our 80GB GPU memory.