Large-scale model training on Chameleon

In this tutorial, we will practice fine-tuning a large language model. We will use a selection of techniques to allow us to train models that would not otherwise fit in GPU memory:

To run this experiment, you should have already created an account on Chameleon, and become part of a project. You must also have added your SSH key to the CHI@UC site.

Experiment topology

In this experiment, we will deploy a single bare metal instance with specific NVIDIA GPU capabilities.

We can browse Chameleon hardware configurations for suitable node types using the Hardware Browser.

For example, to find nodes with 4x GPUs: if we expand “Advanced Filters”, check the “4” box under “GPU count”, and then click “View”, we can identify some suitable node types: gpu_a100_pcie, gpu_a100_nvlink, gpu_v100, or gpu_v100_nvlink at CHI@UC. (The NVLink-type nodes have a high-speed interconnect between GPUs.)

(We will avoid p100-based node types for this experiment, because the P100 has less GPU RAM, and less compute capability.)

Create a lease

To use bare metal resources on Chameleon, we must reserve them in advance. We can use the OpenStack graphical user interface, Horizon, to submit a lease for an A100 or V100 node at CHI@UC. To access this interface,

If you plan to do “Single GPU” and “Multiple GPU” together in a 3-hour block:

Your lease status should show as “Pending”. Click on the lease to see an overview. It will show the start time and end time, and it will show the name of the physical host that is reserved for you as part of your lease. Make sure that the lease details are correct.

If you plan to do “Single GPU” and “Multiple GPU” separately in two 2-hour blocks:

First, make a 2-hour reservation on a node with a single A100 80GB GPU. We will use a compute_gigaio, but avoid gigaio-compute-06 which has no GPU.

Your lease status should show as “Pending”. If you click on the lease, you can see an overview, including the start time and end time, and it will show the name of the physical host that is reserved for you as part of your lease.

Next, make a 2-hour reservation on a node with 4x A100 or 4x V100 GPU. Repeat the steps above, but for an gpu_a100_pcie or gpu_v100 node type, and use the lease name llm_multi_netID (with your own net ID).

Since you will need the full lease time to actually execute your experiment, you should read all of the experiment material ahead of time in preparation, so that you make the best possible use of your GPU time.

At the beginning of your lease time, you will continue with the next step, in which you bring up a bare metal instance!

Before you begin, open this experiment on Trovi:

You will see several notebooks inside the llm-chi directory - look for the one titled 1_create_server.ipynb. Open this notebook and continue there.

Bring up a GPU server

At the beginning of the lease time, we will bring up our GPU server. We will use the python-chi Python API to Chameleon to provision our server.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected:

from chi import server, context, lease
import os

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@UC")

Change the string in the following cell to reflect the name of your lease (with your own net ID), then run it to get your lease:

l = lease.get_lease(f"llm_netID") # or llm_single_netID, or llm_multi_netID
l.show()

The status should show as “ACTIVE” now that we are past the lease start time.

The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting Run > Run Selected Cell and All Below from the Jupyter menu.

As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!

We will use the lease to bring up a server with the CC-Ubuntu24.04-CUDA disk image. (Note that the reservation information is passed when we create the instance!) This will take up to 10 minutes.

username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"node-llm-{username}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)

Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

s.associate_floating_ip()
s.refresh()
s.check_connectivity()
s.refresh()
s.show(type="widget")

Retrieve code and notebooks on the instance

Now, we can use python-chi to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

s.execute("git clone https://github.com/teaching-on-testbeds/llm-chi")

Set up Docker with NVIDIA container toolkit

To use common deep learning frameworks like Tensorflow or PyTorch, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")
s.execute("docker run hello-world")

We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.

# get NVIDIA container toolkit 
s.execute("curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list")
s.execute("sudo apt update")
s.execute("sudo apt-get install -y nvidia-container-toolkit")
s.execute("sudo nvidia-ctk runtime configure --runtime=docker")
s.execute("sudo systemctl restart docker")

In the following cell, we will verify that we can see our NVIDIA GPUs from inside a container, by passing --gpus-all. (The -rm flag says to clean up the container and remove its filesystem when it finishes running.)

s.execute("docker run --rm --gpus all ubuntu nvidia-smi")

Let’s pull the actual container images that we are going to use,

Pull and start container for “Single GPU” section

Let’s pull the container:

s.execute("docker pull quay.io/jupyter/pytorch-notebook:cuda12-pytorch-2.5.1")

and get it running:

s.execute("docker run -d -p 8888:8888 --gpus all --name torchnb quay.io/jupyter/pytorch-notebook:cuda12-pytorch-2.5.1")

There’s one more thing we must do before we can start out Jupyter server. Rather than expose the Jupyter server to the Internet, we are going to set up an SSH tunnel from our local terminal to our server, and access the service through that tunnel.

Here’s how it works: In your local terminal, run

ssh -L 8888:127.0.0.1:8888 -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where,

This will configure the SSH session so that when you connect to port 8888 locally, it will be forwarded over the SSH tunnel to port 8888 on the host at the other end of the SSH connection.

SSH tunneling is a convenient way to access services on a remote machine when you don’t necessarily want to expose those services to the Internet (for example: if they are not secured from unauthorized access).

Finally, run

s.execute("docker logs torchnb")

Look for the line of output in the form:

http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

and copy it for use in the next section.

You will continue working from the next notebook! But, if you are planning to do the “Multiple GPU” section in the same lease, run the following cells too, to let the container image for the “Multiple GPU” section get pulled in the background. This image takes a loooooong time to pull, so it’s important to get it started and leave it running while you are working in your other tab on the “Single GPU” section.

Pull container for “Multiple GPU” section

s.execute("docker pull pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel")

and let’s also install some software on the host that we’ll use in the “Multiple GPU” section:

s.execute("sudo apt update; sudo apt -y install nvtop")

Train a large model on a single GPU - A100 80GB

In this section, we will practice strategies for training a large model on a single GPU. After completing this section, you should understand the effect of

on a large model training job.

This section requires a host with at least one A100 80GB GPUs.

This notebook will be executed inside a Jupyter interface hosted on a GPU server instance on Chameleon, NOT in the Chameleon Jupyter interface from which we launch experiments (provision servers, etc.)

Open the notebook on Colab

We should have already started a notebook server in a container on a Chameleon GPU host, and set up an SSH tunnel to this notebook server. Now, we will open this notebook in Google Colab and connect it to the runtime that you have in Chameleon. This is a convenient way to work, because the notebook and its outputs will be saved automatically in your Google Drive.

Alternatively, if you prefer not to use Colab (or can’t, for some reason): just put the http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX URL you copied earlier into your browser to open the Jupyter interface directly. But, then you’ll have to open a terminal in that Jupyter interface and run

wget https://raw.githubusercontent.com/teaching-on-testbeds/llm-chi/refs/heads/main/workspace/2_single_gpu_a100.ipynb

to get a copy of this notebook in that workspace.

Make sure that you can see the GPUs:

!nvidia-smi

Prepare LitGPT

For this tutorial, we will fine-tune an TinyLlama or OpenLLaMA large language model using litgpt. LitGPT is a convenient wrapper around many PyTorch Lightning capabilities that makes it easy to fine-tune a GPU using a “recipe” defined in a YAML file. (We’ll also try the Python API for LitGPT in the “Multiple GPU” section of this tutorial.)

You may browse the “recipes” for this experiment in our Github repository.

Our focus will be exclusively on comparing the time and memory requirements of training jobs under different settings - we will completely ignore the loss of the fine-tuned model, and we will make some choices to reduce the overall time of our experiment (to fit in a short Chameleon lease) that wouldn’t make sense if we really needed the fine-tuned model (e.g. using a very small fraction of the training data).

First, install LitGPT:

!pip install 'litgpt[all]'==0.5.7 'lightning<2.5.0.post0'

then, download the foundation models:

!litgpt download TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
!litgpt download openlm-research/open_llama_3b
!litgpt download openlm-research/open_llama_7b
!litgpt download openlm-research/open_llama_13b

Also, get the “recipes” that we will use for LLM fine-tuning. Using the file browser on the left side, look at the contents of the “config” directory.

!git clone https://github.com/teaching-on-testbeds/llm-chi/
!mv llm-chi/workspace/config .

Experiment: Baseline

As a baseline, let’s try an epoch of fine-tuning the TinyLlama-1.1B, using full precision and a batch size of 32:

!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 32

This will fail because the training job won’t fit in our 80GB GPU memory.

Experiment: Reduced batch size

But with a smaller batch size, it fits easily:

!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 8 --train.micro_batch_size 8

Make a note of the training time and memory, which is printed at the end of the training job.

Experiment: Gradient accumulation

By using gradient accumulation to “step” only after a few “micro batches”, we can train with a larger effective “global” batch size, with minimal effect on the memory required:

!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 8

Make a note of the training time and memory, which is printed at the end of the training job.

Experiment: Reduced precision

With a “brain float16” format for numbers, instead of “float32”, we can further reduce the memory required, although this representation is less precise:

!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 8 --precision bf16-true

Make a note of the training time and memory, which is printed at the end of the training job.

Experiment: Mixed precision

With mixed precision, we get back some of the lost precision in the results, at the cost of some additional memory and time:

!litgpt finetune_full --config config/tiny-llama-full.yaml --train.global_batch_size 32 --train.micro_batch_size 8 --precision bf16-mixed

Make a note of the training time and memory, which is printed at the end of the training job.

Experiment: Larger model - 3b

We’ve gained so much GPU memory back with these techniques, we can even train a larger model. Let’s switch from the 1.1B to the 3B model:

!litgpt finetune_full --config config/open-llama-3b-full.yaml --train.global_batch_size 32 --train.micro_batch_size 8 --precision bf16-true

Make a note of the training time and memory, which is printed at the end of the training job.

Experiment: Larger model - 7b

If we reduce the batch size again, we can even train a 7b model:

!litgpt finetune_full --config config/open-llama-7b-full.yaml --train.global_batch_size 16 --train.micro_batch_size 4 --precision bf16-true

Make a note of the training time and memory, which is printed at the end of the training job.

Experiment: Larger model - 13b

Even with the smallest possible batch size, we can’t train a 13B model:

!litgpt finetune_full --config config/open-llama-13b-full.yaml --train.global_batch_size 1 --train.micro_batch_size 1 --precision bf16-true

this will fail with an “out of memory” error. But, if we switch from the Adam optimizer (which has two state values per parameter) to SGD, we can train a 13B model. It’s verrrrry slow, though, so we won’t even train it for a full epoch - just 25 “steps”, so we can get an idea of the memory required:

!litgpt finetune_full --config config/open-llama-13b-full.yaml --train.global_batch_size 1 --train.micro_batch_size 1 --precision bf16-true --optimizer SGD --train.max_steps 25

Experiment: Parameter efficient fine tuning

If we are only fine-tuning, not training a model from scratch, we can also consider LoRA and QLoRA. Let’s try it first with our 1.1B model:

!litgpt finetune --config config/tiny-llama-lora.yaml

The memory required is shockingly small! We can see it with our 3B and 7B models, too:

!litgpt finetune --config config/open-llama-3b-lora.yaml
!litgpt finetune --config config/open-llama-7b-lora.yaml

We can also further reduce the memory required with quantization:

!litgpt finetune --config config/open-llama-7b-lora.yaml --quantize bnb.nf4

Even the 13B model can be trained quickly with minimal memory required, using LoRA:

!litgpt finetune --config config/open-llama-13b-lora.yaml

Train a large model on multiple GPUs - 4x A100 80GB

In this section, we will practice strategies for training a large model using distributed processes across multiple GPUs. This section requires a host with 4x A100 80GB GPUs.

Note: If you have reserved a 4x V100 GPU instance, skip to the V100 section!

After completing this section, you should understand the effect of

on a large model training job.

You may view the Python code we will execute in this experiment in our Github repository.

You will execute the commands in this section either inside an SSH session on the Chameleon “node-llm” server, or inside a container that runs on this server. You will need two terminals arranged side-by-side or vertically, and in both terminals, use SSH to connect to the “node-llm” server.

Start the container

We will run code inside a container that has:

First, make sure there are no other containers running, because we will need exclusive access to the GPUs:

# run on node-llm
docker ps

If any containers are still running, stop them with

# run on node-llm
docker stop CONTAINER

(substituting the container name or ID in place of CONTAINER.)

Then, start the PyTorch + NVIDIA CUDA and NVIDIA CUDA developer tools container with

# run on node-llm
docker run -it -v /home/cc/llm-chi/torch:/workspace --gpus all --ipc host pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Note that:

Install software in the container

Inside the container, install a few Python libraries:

# run inside pytorch container
pip install 'litgpt[all]'==0.5.7 'lightning<2.5.0.post0'

and download the foundation model we are going to fine-tune:

# run inside pytorch container
litgpt download openlm-research/open_llama_7b

Start nvtop on the host

In your second terminal session, start nvtop, which we will use to monitor the resource usage of the NVIDIA GPUs on the host:

# run on node-llm
nvtop

and leave it running throughout all the experiments in this section.

Experiment: OpenLLaMA 7b model on a single A100 80GB

We previously noted that we can train an OpenLLaMA 7b model on a single A100 80GB GPU with bf16 precision and batch size 4, and that this setting would essentially max out the available GPU memory on the A100 80GB.

Now, we’ll repeat this test using the Python API for litgpt instead of its command line interface (and, we won’t use gradient accumulation this time). You may view a100_llama7b_1device.py in our Github repository. Run it inside the container with:

# run inside pytorch container
python3 a100_llama7b_1device.py

As it runs, note in nvtop that only one GPU is used. We will see that for GPU 0, the GPU utilization is close to 100% and the GPU memory utilization is also high, but the other GPUs have zero utilization. Also note that in the list of processes, there is a single process running on device 0.

Take a screenshot of this nvtop display while the script is running, for later reference.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

Experiment: OpenLLaMA 7b model on 4x A100 80GB with DDP

Now, we’ll repeat the same experiment with DDP across 4 GPUs! You may view a100_llama7b_4ddp.py in our Github repository. Inside the container, run

# run inside pytorch container
python3 a100_llama7b_4ddp.py

In this training script, we’ve exchanged

    devices=1,

for

    devices=4,
    strategy=DDPStrategy(),

Note that it may take a minute or two for the training job to start.

As it runs, note in nvtop that four GPUs are used, all with high utilization, and that four processes are listed. Take a screenshot of this nvtop display while the script is running, for later reference.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

Experiment: OpenLLaMA 7b model on 4x A100 80GB with FSDP

With DDP, we have a larger effective batch size (since 4 GPUs process a batch in parallel), but no memory savings. With FSDP, we can shard optimizer state, gradients, and parameters across GPUs, to also reduce the memory required.

You may view a100_llama7b_4fsdp.py in our Github repository.

Inside the container, run:

# run inside pytorch container
python3 a100_llama7b_4fsdp.py

In this training script, we’ve exchanged

    strategy=DDPStrategy(),

for

    strategy=FSDPStrategy(sharding_strategy='FULL_SHARD'),

As it runs, note in nvtop that four GPUs are used, with high utilization of the GPU but lower utilization of its memory. Take a screenshot of this nvtop display while the script is running, for later reference.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

Experiment: OpenLLaMA 7b model on 4x A100 80GB with FSDP and larger batch size

Because of the memory savings achieved by FSDP, we can increase the batch size (and potentially achieve faster training times) without running out of memory.

You may view a100_llama7b_4fsdp_8batch.py in our Github repository.

Inside the container, run:

# run inside pytorch container
python3 a100_llama7b_4fsdp_8batch.py

In this training script, we’ve changed the batch_size to 8.

As it runs, note in nvtop that the GPUs again have high memory utilization. Take a screenshot of this nvtop display while the script is running, for later reference.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

(Optional) Experiment: OpenLLaMA 13b model on 4x A100 80GB with CPU optimizer offload via DeepSpeed

Finally, as an optional experiment, we can try training a much bigger model - the 13B OpenLLaMA model - using a combination of:

You may view a100_llama13b_deepspeed.py in our Github repository.

For this experiment, we’ll install deepspeed:

# run inside pytorch container
DS_BUILD_CPU_ADAM=1 pip install deepspeed

and download the 13b model:

# run inside pytorch container
litgpt download openlm-research/open_llama_13b

Now, we can run

# run inside pytorch container
python3 a100_llama13b_deepspeed.py

In this training script, besides for replacing the 7B model with the 13B model:

    strategy=DeepSpeedStrategy(
        stage=3,                 # Similar to FULL_SHARD
        offload_optimizer=True   # Enable CPU offloading of optimizer
    ),

As it runs, note in nvtop that especially near the end of the step, the GPUs will be underutilized as they wait for CPU.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

Debugging note

Note: if any training job crashes due to OOM, you can ensure all of the distributed processes are stopped by running

# run inside pytorch container
pkill -9 python

Train a large model on multiple GPUs - 4x V100 32GB

In this section, we will practice strategies for training a large model using distributed processes across multiple GPUs. This section requires a host with 4x V100 GPUs with 32GB video RAM.

Note: If you have already done the “Multiple GPU” section on a 4x A100 GPU instance, you will skip this section! This is just an alternative version of the same ideas, executed on different hardware.

After completing this section, you should understand the effect of

on a large model training job.

You may view the Python code we will execute in this experiment in our Github repository.

You will execute the commands in this section either inside an SSH session on the Chameleon “node-llm” server, or inside a container that runs on this server. You will need two terminals arranged side-by-side or vertically, and in both terminals, use SSH to connect to the “node-llm” server.

Start the container

We will run code inside a container that has:

First, make sure there are no other containers running, because we will need exclusive access to the GPUs:

# run on node-llm
docker ps

If any containers are still running, stop them with

# run on node-llm
docker stop CONTAINER

(substituting the container name or ID in place of CONTAINER.)

Then, start the PyTorch + NVIDIA CUDA and NVIDIA CUDA developer tools container with

# run on node-llm
docker run -it -v /home/cc/llm-chi/torch:/workspace --gpus all --ipc host pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

Note that:

Install software in the container

Inside the container, install a few Python libraries:

# run inside pytorch container
pip install 'litgpt[all]'==0.5.7 'lightning<2.5.0.post0'

and download the foundation model we are going to fine-tune:

# run inside pytorch container
litgpt download TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T

Start nvtop on the host

In your second terminal session, start nvtop, which we will use to monitor the resource usage of the NVIDIA GPUs on the host:

# run on node-llm
nvtop

and leave it running throughout all the experiments in this section.

Experiment: TinyLlama 1.1B model on a single V100 32GB

We previously noted that we can train an TinyLlama 1.1B model on a single GPU with bf16 precision and micro batch size 8, with less than 32GB of RAM (to fit in the memory of our V100).

Now, we’ll repeat this test using the Python API for litgpt instead of its command line interface. You may view v100_llama1b_1device.py in our Github repository. Run it inside the container with:

# run inside pytorch container
python3 v100_llama1b_1device.py

As it runs, note in nvtop that only one GPU is used. We will see that for GPU 0, the GPU utilization is close to 100% and the GPU memory utilization is also high, but the other GPUs have zero utilization. Also note that in the list of processes, there is a single process running on device 0.

Take a screenshot of this nvtop display while the script is running, for later reference.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

Experiment: TinyLlama 1.1B model on 4x V100 32GB with DDP

Now, we’ll repeat the same experiment with DDP across 4 GPUs! You may view v100_llama1b_4ddp.py in our Github repository. Inside the container, run

# run inside pytorch container
python3 v100_llama1b_4ddp.py

In this training script, we’ve exchanged

    devices=1,

for

    devices=4,
    strategy=DDPStrategy(),

Note that it may take a minute or two for the training job to start.

As it runs, note in nvtop that four GPUs are used, all with high utilization, and that four processes are listed. Take a screenshot of this nvtop display while the script is running, for later reference.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

Experiment: TinyLlama 1.1B model on 4x V100 32GB with FSDP

With DDP, we have a larger effective batch size (since 4 GPUs process a batch in parallel), but no memory savings. With FSDP, we can shard optimizer state, gradients, and parameters across GPUs, to also reduce the memory required.

You may view v100_llama1b_4fsdp.py in our Github repository.

Inside the container, run:

# run inside pytorch container
python3 v100_llama1b_4fsdp.py

In this training script, we’ve exchanged

    strategy=DDPStrategy(),

for

    strategy=FSDPStrategy(sharding_strategy='FULL_SHARD'),

As it runs, note in nvtop that four GPUs are used, with high utilization of the GPU but lower utilization of its memory. Take a screenshot of this nvtop display while the script is running, for later reference.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

Experiment: OpenLLaMA 3B model on 4x V100 32GB with FSDP

Because of the memory savings achieved by FSDP, we can train a larger model without running out of memory.

You may view v100_llama3b_4fsdp.py in our Github repository.

Inside the container, run:

# run inside pytorch container
python3 v100_llama3b_4fsdp.py

In this training script, we’ve changed the model to openlm-research/open_llama_3b and the batch size to 1, and turned off gradient accumulation. Also, since this is slow, we’re training on a smaller fraction of the data than in our previous experiments.

Take a screenshot of the nvtop display while the script is running, for later reference.

When the python3 command finishes running in the container, note the training time (displayed to the right of the progress bar) and the memory usage reported in the output, and take a screenshot for later reference.

Note that we cannot train this model on a single V100 without running out of memory - try

# run inside pytorch container
python3 v100_llama3b_1device.py

which runs on one GPU but otherwise has the same configuration, and observe that you get an OOM error.

Debugging note

Note: if any training job crashes due to OOM, you can ensure all of the distributed processes are stopped by running

# run inside pytorch container
pkill -9 python

Questions about this material? Contact Fraida Fund


This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.