System optimizations for serving

We have previously explored model optimizations for serving, which focus specifically on reducing the inference time of a model. However, the overall prediction latency of a machine learning system includes other delays besides for that inference time - notably, queuing delay.

In this tutorial, we will explore system-level optimizations to improve those other delay elements. We will:

learn how to wrap a model in an HTTP endpoint using FastAPI
and explore system-level optimizations for model serving, including concurrency and batching, in Triton Inference Server

To run this experiment, you should have already created an account on Chameleon, and become part of a project. You must also have added your SSH key to the CHI@TACC site.

Experiment resources

For this experiment, we will provision one bare-metal node with two NVIDIA P100 GPUs, using a gpu_p100 node type.

Create a lease for a GPU server

To use bare metal resources on Chameleon, we must reserve them in advance. For this experiment, we will reserve a 3-hour block on a bare metal node with 2x P100 GPU.

We can use the OpenStack graphical user interface, Horizon, to submit a lease. To access this interface,

from the Chameleon website
click “Experiment” > “CHI@TACC”
log in if prompted to do so
check the project drop-down menu near the top left (which shows e.g. “CHI-XXXXXX”), and make sure the correct project is selected.

Then,

On the left side, click on “Reservations” > “Leases”, and then click on “Host Calendar”. In the “Node type” drop down menu, change the type to gpu_p100 to see the schedule of availability. You may change the date range setting to “30 days” to see a longer time scale. Note that the dates and times in this display are in UTC. You can use WolframAlpha or equivalent to convert to your local time zone.
Once you have identified an available three-hour block in UTC time that works for you in your local time zone, make a note of:
- the start and end time of the time you will try to reserve. (Note that if you mouse over an existing reservation, a pop up will show you the exact start and end time of that reservation.)
- and the node name.
Then, on the left side, click on “Reservations” > “Leases”, and then click on “Create Lease”:
- set the “Name” to serve_system_netID where in place of netID you substitute your actual net ID.
- set the start date and time in UTC. To make scheduling smoother, please start your lease on an hour boundary, e.g. XX:00.
- modify the lease length (in days) until the end date is correct. Then, set the end time. To be mindful of other users, you should limit your lease time to three hours as directed. Also, to avoid a potential race condition that occurs when one lease starts immediately after another lease ends, you should end your lease five minutes before the end of an hour, e.g. at YY:55.
- Click “Next”.
On the “Hosts” tab,
- check the “Reserve hosts” box
- leave the “Minimum number of hosts” and “Maximum number of hosts” at 1
- in “Resource properties”, specify the node name that you identified earlier.
Click “Next”. Then, click “Create”. (We won’t include any network resources in this lease.)

Your lease status should show as “Pending”. Click on the lease to see an overview. It will show the start time and end time, and it will show the name of the physical host that is reserved for you as part of your lease. Make sure that the lease details are correct.

Since you will need the full lease time to actually execute your experiment, you should read all of the experiment material ahead of time in preparation, so that you make the best possible use of your time.

At the beginning of your GPU server lease

At the beginning of your GPU lease time, you will continue with the next step, in which you bring up and configure a bare metal instance! To begin this step, open this experiment on Trovi:

Use this link: Model optimizations for serving on Trovi
Then, click “Launch on Chameleon”. This will start a new Jupyter server for you, with the experiment materials already in it, including the notebok to bring up the bare metal server.

Launch and set up NVIDIA P100 x2 - with python-chi

At the beginning of the lease time for your bare metal server, we will bring up our GPU instance. We will use the python-chi Python API to Chameleon to provision our server.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected. Also change the site to CHI@TACC or CHI@UC, depending on where your reservation is.

from chi import server, context, lease
import os

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@TACC")

Change the string in the following cell to reflect the name of your lease (with your own net ID), then run it to get your lease:

l = lease.get_lease(f"serve_system_netID") 
l.show()

The status should show as “ACTIVE” now that we are past the lease start time.

The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting “Run” > “Run Selected Cell and All Below” from the Jupyter menu.

As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!

We will use the lease to bring up a server with the CC-Ubuntu24.04-CUDA disk image.

Note: the following cell brings up a server only if you don’t already have one with the same name! (Regardless of its error state.) If you have a server in ERROR state already, delete it first in the Horizon GUI before you run this cell.

username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"node-serve-system-{username}", 
    reservation_id=l.node_reservations[0]["id"],
    image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)

Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.

Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.

s.associate_floating_ip()

s.refresh()
s.check_connectivity()

In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).

s.refresh()
s.show(type="widget")

Retrieve code and notebooks on the instance

Now, we can use python-chi to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

s.execute("git clone https://github.com/teaching-on-testbeds/serve-system-chi")

Set up Docker

To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.

s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

Set up the NVIDIA container toolkit

We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.

s.execute("curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list")
s.execute("sudo apt update")
s.execute("sudo apt-get install -y nvidia-container-toolkit")
s.execute("sudo nvidia-ctk runtime configure --runtime=docker")
# for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
s.execute("sudo jq 'if has(\"exec-opts\") then . else . + {\"exec-opts\": [\"native.cgroupdriver=cgroupfs\"]} end' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp > /dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json")
s.execute("sudo systemctl restart docker")

and, install nvtop -

s.execute("sudo apt -y install nvtop")

Open an SSH session

Finally, open an SSH sesson on your server. From your local terminal, run

ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where

in place of ~/.ssh/id_rsa_chameleon, substitute the path to your own key that you had uploaded to CHI@TACC
in place of A.B.C.D, use the floating IP address you just associated to your instance.

Preparing an endpoint in FastAPI

In this section, we will create a FastAPI “wrapper” for our model, so that it can serve inference requests. Once you have finished this section, you should be able to:

create a FastAPI endpoint for a PyTorch model
create a FastAPI endpoint for an ONNX model

and run it on CPU or GPU.

PyTorch version

We have previously seen a Flask app that does inference using a pre-trained PyTorch model, and serves a basic browser-based interface for it.

However, to scale up, we will want to separate the model inference service into its own prediction endpoint - that way, we can optimize and scale it separately from the user interface.

Here is the modified version of the Flask app. Instead of loading a model and making predictions, we send a request to a separate service:

def request_fastapi(image_path):
    try:
        with open(image_path, 'rb') as f:
            image_bytes = f.read()
        
        encoded_str = base64.b64encode(image_bytes).decode("utf-8")
        payload = {"image": encoded_str}
        
        response = requests.post(f"{FASTAPI_SERVER_URL}/predict", json=payload)
        response.raise_for_status()
        
        result = response.json()
        predicted_class = result.get("prediction")
        probability = result.get("probability")
        
        return predicted_class, probability

    except Exception as e:
        print(f"Error during inference: {e}")  
        return None, None  

Meanwhile, the inference service has moved into a separate app:

app = FastAPI(
    title="Food Classification API",
    description="API for classifying food items from images",
    version="1.0.0"
)
# Define the request and response models
class ImageRequest(BaseModel):
    image: str  # Base64 encoded image

class PredictionResponse(BaseModel):
    prediction: str
    probability: float = Field(..., ge=0, le=1)  # Ensures probability is between 0 and 1

# Set device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the Food11 model
MODEL_PATH = "food11.pth"
model = torch.load(MODEL_PATH, map_location=device, weights_only=False)
model.to(device)
model.eval()

# Define class labels
classes = np.array(["Bread", "Dairy product", "Dessert", "Egg", "Fried food",
    "Meat", "Noodles/Pasta", "Rice", "Seafood", "Soup", "Vegetable/Fruit"])

# Define the image preprocessing function
def preprocess_image(img):
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    return transform(img).unsqueeze(0)

@app.post("/predict")
def predict_image(request: ImageRequest):
    try:
        # Decode base64 image
        image_data = base64.b64decode(request.image)
        image = Image.open(io.BytesIO(image_data)).convert("RGB")
        
        # Preprocess the image
        image = preprocess_image(image).to(device)

        # Run inference
        with torch.no_grad():
            output = model(image)
            probabilities = F.softmax(output, dim=1)  # Apply softmax to get probabilities
            predicted_class = torch.argmax(probabilities, 1).item()
            confidence = probabilities[0, predicted_class].item()  # Get the probability

        return PredictionResponse(prediction=classes[predicted_class], probability=confidence)

    except Exception as e:
        return {"error": str(e)}

Let’s try it now!

Bring up containers

To start, run

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-fastapi.yaml up -d

This will use a Docker Compose file to bring up three containers:

one container that will host the Flask application, this will serve the web-based user interface of our system
one container that will host a FastAPI inference endpoint
and one Jupyter container, which we’ll use to run some benchmarking experiments

Check the logs of the Jupyter container:

# runs on node-serve-system
docker logs jupyter

and look for a line like

http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Paste this into a browser tab, but in place of 127.0.0.1, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface that is running on your compute instance.

Then, in the file browser on the left side, open the “work” directory and then click on the fastapi.ipynb notebook to continue.

Let’s test this service. First, we’ll test the FastAPI endpoint directly. In a browser, run

http://A.B.C.D:8000/docs

but substitute the floating IP assigned to your instance. This will bring up the Swagger UI associated with the FastAPI endpoint.

Click on “predict” and then “Try it out”. Here, we can enter a request to send to the FastAPI endpoint, and see its response.

Our request needs to be in the form of a base64-encoded image. Run

import base64
image_path = "test_image.jpeg"
with open(image_path, 'rb') as f:
    image_bytes = f.read()
encoded_str =  base64.b64encode(image_bytes).decode("utf-8")
print('"' + encoded_str + '"')

to get the encoded image string. Copy the output of that cell. (After you copy it, you can right-click and clear the cell output, so it won’t clutter up the notebook interface.)

Then, in

{
  "image": "string"
}

replace “string” with the encoded image string you just copied. Press “Execute”.

You should see that the server returns a response with code 200 (that’s the response code for a successful request) and a response body like:

{
  "prediction": "Vegetable/Fruit",
  "probability": 0.9940803647041321
}

so we can see that it performed inference successfully on the test input.

Next, let’s check the integration of the FastAPI endpoint in our Flask app. In your browser, open

http://A.B.C.D

but substitute the floating IP assigned to your instance, to access the Flask app. Upload an image and press “Submit” to get its class label.

Now that we know everything works, let’s get some quick performance numbers from this server. We’ll send some requests directly to the FastAPI endpoint and measure the time to get a response.

import requests
import time
import numpy as np

FASTAPI_URL = "http://fastapi_server:8000/predict"
payload = {"image": encoded_str}
num_requests = 100
inference_times = []

for _ in range(num_requests):
    start_time = time.time()
    response = requests.post(FASTAPI_URL, json=payload)
    end_time = time.time()

    if response.status_code == 200:
        inference_times.append(end_time - start_time)
    else:
        print(f"Error: {response.status_code}, Response: {response.text}")

inference_times = np.array(inference_times)
median_time = np.median(inference_times)
percentile_95 = np.percentile(inference_times, 95)
percentile_99 = np.percentile(inference_times, 99)
throughput = num_requests / inference_times.sum()  

print(f"Median inference time: {1000*median_time:.4f} ms")
print(f"95th percentile: {1000*percentile_95:.4f} ms")
print(f"99th percentile: {1000*percentile_99:.4f} seconds")
print(f"Throughput: {throughput:.2f} requests/sec")

ONNX version

We know from our previous experiments that the vanilla PyTorch model may not be optimized for inference speed.

Let’s try porting our FastAPI endpoint to ONNX.

On the “node-serve-system” host, edit the Docker compose file:

# runs on node-serve-system
nano ~/serve-system-chi/docker/docker-compose-fastapi.yaml

and modify

      context: /home/cc/serve-system-chi/fastapi_pt

      context: /home/cc/serve-system-chi/fastapi_onnx

to build the FastAPI container image from the “fastapi_onnx” directory, instead of the “fastapi_pt” directory.

Save your changes (Ctrl+O, Enter, Ctrl+X). Rebuild the container image:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-fastapi.yaml build fastapi_server

and recreate the container with the new image:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-fastapi.yaml up fastapi_server --force-recreate -d

Repeat the same steps as before to test the FastAPI endpoint and its integration with Flask.

Then, re-do our quick benchmark.

(This host is running an older GPU, so we won’t attempt to use the TensorRT execution provider for ONNX, because modern versions no longer support it. So, our results won’t be too dramatic.)

FASTAPI_URL = "http://fastapi_server:8000/predict"
payload = {"image": encoded_str}
num_requests = 100
inference_times = []

for _ in range(num_requests):
    start_time = time.time()
    response = requests.post(FASTAPI_URL, json=payload)
    end_time = time.time()

    if response.status_code == 200:
        inference_times.append(end_time - start_time)
    else:
        print(f"Error: {response.status_code}, Response: {response.text}")

inference_times = np.array(inference_times)
median_time = np.median(inference_times)
percentile_95 = np.percentile(inference_times, 95)
percentile_99 = np.percentile(inference_times, 99)
throughput = num_requests / inference_times.sum()  

print(f"Median inference time: {1000*median_time:.4f} ms")
print(f"95th percentile: {1000*percentile_95:.4f} ms")
print(f"99th percentile: {1000*percentile_99:.4f} seconds")
print(f"Throughput: {throughput:.2f} requests/sec")

Our FastAPI endpoint can maintain low latency, as long as only one user is sending requests to the service.

However, when there are multiple concurrent requests, it will be much slower. For example, suppose we start 16 “senders” at the same time, each continuously sending a new request as soon as it gets a response for the last one:

import concurrent.futures

def send_request(payload):
    start_time = time.time()
    response = requests.post(FASTAPI_URL, json=payload)
    end_time = time.time()
    
    if response.status_code == 200:
        return end_time - start_time
    else:
        print(f"Error: {response.status_code}, Response: {response.text}")
        return None

def run_concurrent_tests(num_requests, payload, max_workers=10):
    inference_times = []
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(send_request, payload) for _ in range(num_requests)]
        
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            if result is not None:
                inference_times.append(result)
    
    return inference_times

num_requests = 1000
start_time = time.time()
inference_times = run_concurrent_tests(num_requests, payload, max_workers=16)
total_time = time.time() - start_time

inference_times = np.array(inference_times)
median_time = np.median(inference_times)
percentile_95 = np.percentile(inference_times, 95)
percentile_99 = np.percentile(inference_times, 99)
throughput = num_requests / total_time

print(f"Median inference time: {1000*median_time:.4f} ms")
print(f"95th percentile: {1000*percentile_95:.4f} ms")
print(f"99th percentile: {1000*percentile_99:.4f} seconds")
print(f"Throughput: {throughput:.2f} requests/sec")

When a request arrives at the server and finds it busy processing another request, it waits in a queue until it can be served. This queuing delay can be a significant part of the overall prediction delay, when there is a high degree of concurrency. We will attempt to address this in the next section!

In the meantime, bring down your current inference service with:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-fastapi.yaml down

Then, download this entire notebook for later reference.

Using a Triton Inference Server

Triton Inference Server is an open-source project by NVIDIA for high-performance ML model deployment. In this section, we will practice deploying models using Triton; after you have finished, you should be able to:

serve a model using Triton Inference Server with Python backend
use dynamic batching to improve performance
scale your model to run on multiple GPUs, and/or with multiple instances on the same GPU
benchmark the Triton service, and recognize indications of potential problems
and use optimized backends

Anatomy of a Triton model with Python backend

To start, run

# runs on node-serve-system
mkdir ~/serve-system-chi/models/
cp -r ~/serve-system-chi/models_staging/food_classifier ~/serve-system-chi/models/

to copy our first configuration into the directory from which Triton will load models.

Our initial implementation serves our food image classifier using PyTorch. Here’s how it works.

In the Dockerfile, the Triton server is started with the command

tritonserver --model-repository=/models

where the /models directry is organized as follows:

models/
└── food_classifier
    ├── 1
    │   ├── food11.pth
    │   └── model.py
    └── config.pbtxt

It includes:

a top-level directory whose name is the “model name”
a configuration file config.pbtxt inside that directory. We’ll look at that shortly.
and a subdirectory for each model version. We have model version 1, so we have a subdirectory 1. Inside this directory is a model.py, which describes how the model will run.

Let’s look at the configuration file first. Here are the contents of config.pbtxt:

name: "food_classifier"
backend: "python"
max_batch_size: 16
input [
  {
    name: "INPUT_IMAGE"
    data_type: TYPE_STRING
    dims: [1]
  }
]
output [
  {
    name: "FOOD_LABEL"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "PROBABILITY"
    data_type: TYPE_FP32
    dims: [1]
  }
]
  instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]

We have defined:

a name, which must match the directory name
a backend - we are using the basic Python backend. This is a highly flexible backend which allows us to define how our model will run by providing Python code in a model.py file.
a max_batch_size - we have set it to 16, but generally you would set this according to the GPU memory available
the name, data_type, and dims (dimensions) of each input to the model
the name, data_type, and dims (dimensions) of each output from the model
an instance_group with the count (number of copies of the model that we want to serve) and details of the device we want to serve it on (we will serve it on GPU 0). Note that to run the model on CPU instead, we could have used

  instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
  ]

Next, let’s look at model.py. For a Triton model with Python backend, the model.py must define a class named TritonPythonModel with at least an initialize and execute method. Ours has:

An initialize method to load the model, move it to the device specified in the args passed from the Triton server, and put it in inference mode. This will run as soon as Triton starts and loads models from the directory passed to it:

def initialize(self, args):
        model_dir = os.path.dirname(__file__)
        model_path = os.path.join(model_dir, "food11.pth")
        
        # From args, get info about what device the model is supposed to be on
        instance_kind = args.get("model_instance_kind", "cpu").lower()
        if instance_kind == "gpu":
            device_id = int(args.get("model_instance_device_id", 0))
            torch.cuda.set_device(device_id)
            self.device = torch.device(f"cuda:{device_id}" if torch.cuda.is_available() else 'cpu')
        else:
            self.device = torch.device('cpu')

        self.model = torch.load(model_path, map_location=self.device, weights_only=False)
        self.model.to(self.device)
        self.model.eval()

        self.transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                    std=[0.229, 0.224, 0.225]),
        ])
        self.classes = np.array([
            "Bread", "Dairy product", "Dessert", "Egg", "Fried food",
            "Meat", "Noodles/Pasta", "Rice", "Seafood", "Soup",
            "Vegetable/Fruit"
        ])

A preprocess method, which will run on each input image that is passed:

def preprocess(self, image_data):
    if isinstance(image_data, str):
        image_data = base64.b64decode(image_data)

    if isinstance(image_data, bytes):
        image_data = image_data.decode("utf-8")
        image_data = base64.b64decode(image_data)

    image = Image.open(io.BytesIO(image_data)).convert('RGB')

    img_tensor = self.transform(image).unsqueeze(0)
    return img_tensor

and an execute, which will apply to batches of requests sent to this model:

def execute(self, requests):
    # Gather inputs from all requests
    batched_inputs = []
    for request in requests:
        in_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT_IMAGE")
        input_data_array = in_tensor.as_numpy()  # each assumed to be shape [1]
        # Preprocess each input (resulting in a tensor of shape [1, C, H, W])
        batched_inputs.append(self.preprocess(input_data_array[0, 0]))
    
    # Combine inputs along the batch dimension
    batched_tensor = torch.cat(batched_inputs, dim=0).to(self.device)
    print("BatchSize: ", len(batched_inputs))
    # Run inference once on the full batch
    with torch.no_grad():
        outputs = self.model(batched_tensor)
    
    # Process the outputs and split them for each request
    responses = []
    for i, request in enumerate(requests):
        output = outputs[i:i+1]  # select the i-th output
        prob, predicted_class = torch.max(output, 1)
        predicted_label = self.classes[predicted_class.item()]
        probability = torch.sigmoid(prob).item()
        
        # Create numpy arrays with shape [1, 1] for consistency.
        out_label_np = np.array([[predicted_label]], dtype=object)
        out_prob_np = np.array([[probability]], dtype=np.float32)
        
        out_tensor_label = pb_utils.Tensor("FOOD_LABEL", out_label_np)
        out_tensor_prob = pb_utils.Tensor("PROBABILITY", out_prob_np)
        
        inference_response = pb_utils.InferenceResponse(
            output_tensors=[out_tensor_label, out_tensor_prob])
        responses.append(inference_response)
    
    return responses

Finally, now that we understand how the server works, let’s look at how the Flask app sends requests to it. Inside the Flask app, we now have a function which is called whenever there is a new image uploaded to predict or test, which sends the image to the Triton server:

def request_triton(image_path):
    try:
        # Connect to Triton server
        triton_client = httpclient.InferenceServerClient(url=TRITON_SERVER_URL)

        # Prepare inputs and outputs
        with open(image_path, 'rb') as f:
            image_bytes = f.read()

        inputs = []
        inputs.append(httpclient.InferInput("INPUT_IMAGE", [1, 1], "BYTES"))

        encoded_str =  base64.b64encode(image_bytes).decode("utf-8")
        input_data = np.array([[encoded_str]], dtype=object)
        inputs[0].set_data_from_numpy(input_data)

        outputs = []
        outputs.append(httpclient.InferRequestedOutput("FOOD_LABEL", binary_data=False))
        outputs.append(httpclient.InferRequestedOutput("PROBABILITY", binary_data=False))

        # Run inference
        results = triton_client.infer(model_name=FOOD11_MODEL_NAME, inputs=inputs, outputs=outputs)

        predicted_class = results.as_numpy("FOOD_LABEL")[0,0]
        probability = results.as_numpy("PROBABILITY")[0,0]

        return predicted_class, probability

    except Exception as e:
        print(f"Error during inference: {e}")  
        return None, None  

Bring up containers

To start, run

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up -d

This uses a Docker Compose configuration to bring up three containers:

one container with NVIDIA Triton Server, with the host’s GPUs passed to the container, and with the models directory (containing the model and its configuration) passed as a bind mount
one container that hosts the Flask app, which will serve the user interface and send inference requests to the Triton server
one Jupyter container with the Triton client installed, for us to conduct a performance evaluation of the Triton server

Watch the logs from the Triton server as it starts up:

# runs on node-serve-system
docker logs triton_server -f

Once the Triton server starts up, you should see something like

+--------------------------+---------+--------+
| Model                    | Version | Status |
+--------------------------+---------+--------+
| food_classifier | 1       | READY  |
+--------------------------+---------+--------+

and then some additional output. Near the end, you will see

"Started GRPCInferenceService at 0.0.0.0:8001"
"Started HTTPService at 0.0.0.0:8000"
"Started Metrics Service at 0.0.0.0:8002"

(and then some messages about not getting GPU power consumption, which is fine and not a concern.)

You can use Ctrl+C to stop watching the logs once you see this output.

Let’s test this service. In a browser, run

http://A.B.C.D

but substitute the floating IP assigned to your instance, to access the Flask app. Upload an image and press “Submit” to get its class label.

Finally, check the logs of the Jupyter container:

# runs on node-serve-system
docker logs jupyter

and look for a line like

http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Paste this into a browser tab, but in place of 127.0.0.1, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface that is running on your compute instance.

Then, in the file browser on the left side, open the “work” directory and then click on the triton.ipynb notebook to continue.

Meanwhile, on the host, run

# runs on node-serve-system
nvtop

to monitor GPU usage - we will refer back to this a few times as we run through the rest of this notebook.

Serving a PyTorch model

The Triton client comes with a performance analyzer, which we can use to send requests to the server and get some statistics back. Let’s try it:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 

Make a note of the line showing the total average request latency, and the breakdown including:

queue, the queuing delay
and compute infer, the inference delay

Let’s further exercise this service. In the command above, a single client sends continuous requests to the server - each time a response is returned, a new request is generated. Now, let’s configure 8 concurrent clients, each sending continuous requests - as soon as any client gets a response, it sends a new request:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --concurrency-range 8

While the inference time (compute infer) remains low, the overall system latency is high because of queue delay. Only one sample is processed at a time, and other samples have to wait in a queue for their turn. Here, since there are 8 concurrent clients sending continuous requests, the delay is approximately 8x the inference delay. With more concurrent requests, the queuing delay would grow even larger:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --concurrency-range 16

Although the delay is large (over 100 ms), it’s not because of inadequate compute - if you check the nvtop display on the host while the test above is running, you will note low GPU utilization! Take a screenshot of the nvtop output when this test is running.

We could get more throughput without increasing prediction latency, by batching requests:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 16 --concurrency-range 1

But, that’s not very helpful in a situation when requests come from individual users, one at a time.

Dynamic batching

Earlier, we noted that our model can achieve higher throughput with low latency by performing inference on batches of input samples, instead of individual samples. However, our client sends requests with individual samples.

To improve performance, we can ask the Triton server to batch incoming requests whenever possible, and send them through the server together instead of a sequence. In other words, if the server is ready to handle the next request, and it finds four requests waiting in the queue, it should serve those four as a batch instead of just taking the next request in line.

Let’s edit the model configuration:

# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier/config.pbtxt

and at the end, add

dynamic_batching {
  preferred_batch_size: [4, 6, 8, 10]
  max_queue_delay_microseconds: 100
}

Save the file (use Ctrl+O then Enter, then Ctrl+X).

Re-build the container image with this change:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server

and then bring the server back up:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d

and use

# runs on node-serve-system
docker logs triton_server

to make sure the server comes up and is ready.

Before we benchmark this service again, let’s get some pre-benchmark stats about how many requests have been served, broken down by batch size. (If you’ve just restarted the server, it would be zero!)

# runs inside Jupyter container
curl http://triton_server:8000/v2/models/food_classifier/versions/1/stats

Then, run the benchmark:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --concurrency-range 8

and get per-batch stats again:

# runs inside Jupyter container
curl http://triton_server:8000/v2/models/food_classifier/versions/1/stats

Note that the stats show that some requests were served in batch sizes greater than 1, even though each client sent a single request at a time.

Scaling up

Another easy way to improve performance is to scale up! Let’s edit the model configuration:

# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier/config.pbtxt

and change

  instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]

to run two instances on GPU 0 and two instances on GPU 1:

  instance_group [
    {
      count: 2
      kind: KIND_GPU
      gpus: [ 0 ]
    },
    {
      count: 2
      kind: KIND_GPU
      gpus: [ 1 ]
    }
]

Save the file (use Ctrl+O then Enter, then Ctrl+X).

Re-build the container image with this change:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server

and then bring the server back up:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d

and use

# runs on node-serve-system
docker logs triton_server

to make sure the server comes up and is ready.

On the host, run

# runs on node-serve-system
nvidia-smi

and note that there are two instances of triton_python_backend processes running on GPU 0, and two on GPU 1.

Then, benchmark this service with increased concurrency:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --concurrency-range 8

Although there is still some queuing delay (because our degree of concurrency, 8, is still higher than the number of server instances, 4), and the inference time is also increased due to sharing the compute resources, the prediction delay is still on the order of 10s of ms - not over 100ms, like it was previously with concurrency 8!

Also, if you look at the nvtop output on the host while running this test, you will observe higher GPU utilization than before (which is good! We want to use the GPU. Underutilization is bad.) (Take a screenshot!) However, we are still not fully utilizing the GPU.

Let’s try increasing the number of instances again. Edit the model configuration:

# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier/config.pbtxt

and change

  instance_group [
    {
      count: 2
      kind: KIND_GPU
      gpus: [ 0 ]
    },
    {
      count: 2
      kind: KIND_GPU
      gpus: [ 1 ]
    }
]

  instance_group [
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 0 ]
    },
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 1 ]
    }
]

Re-build the container image with this change:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server

and then bring the server back up:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d

use

# runs on node-serve-system
docker logs triton_server

to make sure the server comes up and is ready.

Then, re-run our benchmark:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier  --input-data input.json -b 1 --concurrency-range 8

This makes things worse - our inference time is higher, even though we are still underutilizing the GPU (as seen in nvtop) (take a screenshot!).

Our system is not limited by GPU - we are underutilizing the GPU. However, we are being killed by the overhead of the Python backend and our model.py implementation.

Serving an ONNX model

The Python backend we have been using is flexible, but not necessarily the most performant. To get better performance, we will use one of the highly optimized backend in Triton. Since we already have an ONNX model, let’s use the ONNX backend.

To serve a model using the ONNX backend, we will create a directory structure like this:

food_classifier_onnx/
├── 1
│   └── model.onnx
└── config.pbtxt

There is no more model.py - Triton serves the model directly, we just have to name it model.onnx. In config.pbtxt, we will specify the backend as onnxruntime:

name: "food_classifier_onnx"
backend: "onnxruntime"
max_batch_size: 16
input [
  {
    name: "input"  # has to match ONNX model's input name
    data_type: TYPE_FP32
    dims: [3, 224, 224]  # has to match ONNX input shape
  }
]
output [
  {
    name: "output"  # has to match ONNX model output name
    data_type: TYPE_FP32  # output is a list of probabilities
    dims: [11]  # 
  }
]
  instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]

Copy this to Triton’s models directory:

# runs on node-serve-system
cp -r ~/serve-system-chi/models_staging/food_classifier_onnx ~/serve-system-chi/models/

Re-build the container image with this change:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server

and then bring the server back up:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d

use

# runs on node-serve-system
docker logs triton_server

to make sure the server comes up and is ready. Note that the server will load two models: the original food_classifier with Python backend, and the food_classifier_onnx model we just added.

Let’s benchmark our service. Our ONNX model won’t accept image bytes directly - it expects images that already have been pre-processed into arrays. So, our benchmark command will be a little bit different:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 

This model has much better inference performance than our PyTorch model with Python backend did, in a similar test. Also, if we monitor with nvtop, we should see higher GPU utilization while the test is running (which is a good thing!) (Take a screenshot!)

Let’s try scaling this model up. Edit the model configuration:

# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier_onnx/config.pbtxt

and change

  instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]

  instance_group [
    {
      count: 2      
      kind: KIND_GPU
      gpus: [ 0, 1 ]
    }
]

Save the file (use Ctrl+O then Enter, then Ctrl+X).

Re-build the container image with this change:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server

and then bring the server back up:

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d

and use

# runs on node-serve-system
docker logs triton_server

to make sure the server comes up and is ready.

Then, run our benchmark with higher concurrency. (2 instances on each GPU, because we noticed that a single instance used less than half a GPU.)

Watch the nvtop output as you run this test! (Take a screenshot!)

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --concurrency-range 8 

This time, we should see that our model is fully utilizing the GPU (that’s good!) And, our inference performance is much better than the PyTorch model with Python backend could achieve with concurrency 8.

Let’s see how we do with even higher concurrency:

# runs inside Jupyter container
perf_analyzer -u triton_server:8000  -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --concurrency-range 16  

We still have some queue delay, since the rate at which requests arrive is greater than the service rate of the models. But, we can feel good that we are no longer underutilizing the GPUs!

There’s one more issue we should address: our ONNX model doesn’t directly work with our Flask server now, because the inputs and outputs are different. The ONNX model expects a pre-processed array, and returns a list of class probabilities.

Since the pre-processing and post-processing doesn’t need GPU anyway, we’ll move it to the Flask app.

Edit the Docker compose file:

# runs on node-serve-system
nano ~/serve-system-chi/docker/docker-compose-triton.yaml

and change

  flask:
    build:
      context: https://github.com/teaching-on-testbeds/gourmetgram.git#triton

  flask:
    build:
      context: https://github.com/teaching-on-testbeds/gourmetgram.git#triton_onnx

to use a version of our Flask app where the pre- and post-processing is built in. Also change

      - FOOD11_MODEL_NAME=food_classifier

      - FOOD11_MODEL_NAME=food_classifier_onnx

so that our Flask app will send requests to the new ONNX model service.

Then run

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build flask

to re-build the container image, and

# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up flask --force-recreate -d

to restart the Flask container with the new image.

Let’s test this service. In a browser, run

http://A.B.C.D

but substitute the floating IP assigned to your instance, to access the Flask app. Upload an image and press “Submit” to get its class label.

Then, download this entire notebook for later reference.

Questions about this material? Contact Fraida Fund

This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.