System optimizations for serving
We have previously explored model optimizations for serving, which focus specifically on reducing the inference time of a model. However, the overall prediction latency of a machine learning system includes other delays besides for that inference time - notably, queuing delay.
In this tutorial, we will explore system-level optimizations to improve those other delay elements. We will:
- learn how to wrap a model in an HTTP endpoint using FastAPI
- and explore system-level optimizations for model serving, including concurrency and batching, in Triton Inference Server
To run this experiment, you should have already created an account on Chameleon, and become part of a project. You must also have added your SSH key to the CHI@TACC site.
Experiment resources
For this experiment, we will provision one bare-metal node with two NVIDIA P100 GPUs, using a gpu_p100 node type.
Continue with 1_create_lease.ipynb.
Create a lease for a GPU server
To use bare metal resources on Chameleon, we must reserve them in advance. For this experiment, we will reserve a 2 hour 50 minute block on a bare metal node with 2x P100 GPU.
We can use the OpenStack graphical user interface, Horizon, to submit a lease. To access this interface,
- from the Chameleon website
- click “Experiment” > “CHI@TACC”
- log in if prompted to do so
- check the project drop-down menu near the top left (which shows e.g. “CHI-XXXXXX”), and make sure the correct project is selected.
Then,
- On the left side, click on “Reservations” > “Leases”, and then click on “Host Calendar”. In the “Node type” drop down menu, change the type to
gpu_p100to see the schedule of availability. (However, you should avoidc11-13, on which one GPU is faulty.) You may change the date range setting to “30 days” to see a longer time scale. Note that the dates and times in this display are in UTC. - Once you have identified an available three-hour block in UTC time that works for you in your local time zone, on the left side, click on the name of the node you want to reserve.
- Set the “Name” to
serve_system_netID, replacingnetIDwith your actual net ID. - Set the start date and time in UTC. To make scheduling smoother, please start your lease on an hour boundary, e.g.
XX:00. - Modify the lease length (in days) until the end date is correct. Then, set the end time. To be mindful of other users, set your lease to 2 hours 50 minutes. Also, to avoid a potential race condition that occurs when one lease starts immediately after another lease ends, you should end your lease ten minutes before the end of an hour, e.g. at
YY:50. - Click “Next”.
- On the “Hosts” tab, confirm that the node you selected is listed in the “Resource properties” section, and click “Next”.
- Then, click “Create”. (We won’t include any network resources in this lease.)
Your lease status should show as “Pending”. Click on the lease to see an overview. It will show the start time and end time, and it will show the name of the physical host that is reserved for you as part of your lease. Make sure that the lease details are correct.
Since you will need the full lease time to actually execute your experiment, you should read all of the experiment material ahead of time in preparation, so that you make the best possible use of your time.
At the beginning of your GPU server lease
At the beginning of your GPU lease time, you will continue with the next step, in which you bring up and configure a bare metal instance. To begin this step, open this experiment on Trovi:
- Use this link: Model optimizations for serving on Trovi
- Then, click “Launch on Chameleon”. This will start a new Jupyter server for you, with the experiment materials already in it, including the notebook to bring up the bare metal server.
Inside the serve-system-chi directory, continue with 2_create_server_nvidia.ipynb.
Launch and set up NVIDIA P100 x2 - with python-chi
At the beginning of the lease time for your bare metal server, we will bring up our GPU instance. We will use the python-chi Python API to Chameleon to provision our server.
We will execute the cells in this notebook inside the Chameleon Jupyter environment.
Run the following cell, and make sure the correct project is selected. Also change the site to CHI@TACC or CHI@UC, depending on where your reservation is.
# runs in Chameleon Jupyter environment
from chi import server, context, lease
import os
context.version = "1.0"
context.choose_project()
context.choose_site(default="CHI@TACC")
Change the string in the following cell to reflect the name of your lease (with your own net ID), then run it to get your lease:
# runs in Chameleon Jupyter environment
l = lease.get_lease(f"serve_system_netID")
l.show()
The status should show as “ACTIVE” now that we are past the lease start time.
The rest of this notebook can be executed without any interactions from you, so at this point, you can save time by clicking on this cell, then selecting “Run” > “Run Selected Cell and All Below” from the Jupyter menu.
As the notebook executes, monitor its progress to make sure it does not get stuck on any execution error, and also to see what it is doing!
We will use the lease to bring up a server with the CC-Ubuntu24.04-CUDA disk image.
Note: the following cell brings up a server only if you don’t already have one with the same name! (Regardless of its error state.) If you have a server in ERROR state already, delete it first in the Horizon GUI before you run this cell.
# runs in Chameleon Jupyter environment
username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
f"node-serve-system-{username}",
reservation_id=l.node_reservations[0]["id"],
image_name="CC-Ubuntu24.04-CUDA"
)
s.submit(idempotent=True)
Note: security groups are not used at Chameleon bare metal sites, so we do not have to configure any security groups on this instance.
Then, we’ll associate a floating IP with the instance, so that we can access it over SSH.
# runs in Chameleon Jupyter environment
s.associate_floating_ip()
# runs in Chameleon Jupyter environment
s.refresh()
s.check_connectivity()
In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).
# runs in Chameleon Jupyter environment
s.refresh()
s.show(type="widget")
Retrieve code and notebooks on the instance
Now, we can use python-chi to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.
# runs in Chameleon Jupyter environment
s.execute("git clone https://github.com/teaching-on-testbeds/serve-system-chi")
Set up Docker
To use common deep learning frameworks like Tensorflow or PyTorch, and ML training platforms like MLFlow and Ray, we can run containers that have all the prerequisite libraries necessary for these frameworks. Here, we will set up the container framework.
# runs in Chameleon Jupyter environment
s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")
Set up the NVIDIA container toolkit
We will also install the NVIDIA container toolkit, with which we can access GPUs from inside our containers.
# runs in Chameleon Jupyter environment
s.execute("curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list")
s.execute("sudo apt update")
s.execute("sudo apt-get install -y nvidia-container-toolkit")
s.execute("sudo nvidia-ctk runtime configure --runtime=docker")
# for https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
s.execute("sudo jq 'if has(\"exec-opts\") then . else . + {\"exec-opts\": [\"native.cgroupdriver=cgroupfs\"]} end' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp > /dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json")
s.execute("sudo systemctl restart docker")
and, install nvtop -
# runs in Chameleon Jupyter environment
s.execute("sudo apt -y install nvtop")
Open an SSH session
Finally, open an SSH sesson on your server. From your local terminal, run
ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D
where
- in place of
~/.ssh/id_rsa_chameleon, substitute the path to your own key that you had uploaded to CHI@TACC - in place of
A.B.C.D, use the floating IP address you just associated to your instance.
After connecting over SSH, continue with 3_fastapi_setup.ipynb.
Preparing an endpoint in FastAPI
In this section, we will create a FastAPI “wrapper” for our model, so that it can serve inference requests. Once you have finished this section, you should be able to:
- create a FastAPI endpoint for a PyTorch model
- create a FastAPI endpoint for an ONNX model
and run it on CPU or GPU.
PyTorch version
We have previously seen a Flask app that does inference using a pre-trained PyTorch model, and serves a basic browser-based interface for it.
However, to scale up, we will want to separate the model inference service into its own prediction endpoint - that way, we can optimize and scale it separately from the user interface.
Here is the modified version of the Flask app. Instead of loading a model and making predictions, we send a request to a separate service:
def request_fastapi(image_path):
try:
with open(image_path, 'rb') as f:
image_bytes = f.read()
encoded_str = base64.b64encode(image_bytes).decode("utf-8")
payload = {"image": encoded_str}
response = requests.post(f"{FASTAPI_SERVER_URL}/predict", json=payload)
response.raise_for_status()
result = response.json()
predicted_class = result.get("prediction")
probability = result.get("probability")
return predicted_class, probability
except Exception as e:
print(f"Error during inference: {e}")
return None, None
Meanwhile, the inference service has moved into a separate app:
app = FastAPI(
title="Food Classification API",
description="API for classifying food items from images",
version="1.0.0"
)
# Define the request and response models
class ImageRequest(BaseModel):
image: str # Base64 encoded image
class PredictionResponse(BaseModel):
prediction: str
probability: float = Field(..., ge=0, le=1) # Ensures probability is between 0 and 1
# Set device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the Food11 model
MODEL_PATH = "food11.pth"
model = torch.load(MODEL_PATH, map_location=device, weights_only=False)
model.to(device)
model.eval()
# Define class labels
classes = np.array(["Bread", "Dairy product", "Dessert", "Egg", "Fried food",
"Meat", "Noodles/Pasta", "Rice", "Seafood", "Soup", "Vegetable/Fruit"])
# Define the image preprocessing function
def preprocess_image(img):
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
return transform(img).unsqueeze(0)
@app.post("/predict")
def predict_image(request: ImageRequest):
try:
# Decode base64 image
image_data = base64.b64decode(request.image)
image = Image.open(io.BytesIO(image_data)).convert("RGB")
# Preprocess the image
image = preprocess_image(image).to(device)
# Run inference
with torch.no_grad():
output = model(image)
probabilities = F.softmax(output, dim=1) # Apply softmax to get probabilities
predicted_class = torch.argmax(probabilities, 1).item()
confidence = probabilities[0, predicted_class].item() # Get the probability
return PredictionResponse(prediction=classes[predicted_class], probability=confidence)
except Exception as e:
return {"error": str(e)}
Let’s try it now!
Bring up containers
To start, run
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-fastapi.yaml up -d
This will use a Docker Compose file to bring up three containers:
- one container that will host the Flask application, this will serve the web-based user interface of our system
- one container that will host a FastAPI inference endpoint
- and one Jupyter container, which we’ll use to run some benchmarking experiments
To access the Jupyter service, we will need its randomly generated secret token (which secures it from unauthorized access). We’ll get this token by running jupyter server list inside the jupyter container:
# runs on node-serve-system
docker exec jupyter jupyter server list
Look for a line like
http://localhost:8888/?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Paste this into a browser tab, but in place of localhost, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface that is running on your compute instance.
Then, in the file browser on the left side, open the “work” directory and then click on the 4_fastapi.ipynb notebook to continue.
Benchmarking FastAPI service
Continue here after opening workspace/4_fastapi.ipynb in the Jupyter container.
Let’s test this service. First, we’ll test the FastAPI endpoint directly. In a browser, run
http://A.B.C.D:8000/docs
but substitute the floating IP assigned to your instance. This will bring up the Swagger UI associated with the FastAPI endpoint.
Click on “predict” and then “Try it out”. Here, we can enter a request to send to the FastAPI endpoint, and see its response.
Our request needs to be in the form of a base64-encoded image. Run
# runs inside the Jupyter container on node-serve-system
import base64
image_path = "test_image.jpeg"
with open(image_path, 'rb') as f:
image_bytes = f.read()
encoded_str = base64.b64encode(image_bytes).decode("utf-8")
print('"' + encoded_str + '"')
to get the encoded image string. Copy the output of that cell. (After you copy it, you can right-click and clear the cell output, so it won’t clutter up the notebook interface.)
Then, in
{
"image": "string"
}
replace “string” with the encoded image string you just copied. Press “Execute”.
You should see that the server returns a response with code 200 (that’s the response code for a successful request) and a response body like:
{
"prediction": "Vegetable/Fruit",
"probability": 0.9940803647041321
}
so we can see that it performed inference successfully on the test input.
Next, let’s check the integration of the FastAPI endpoint in our Flask app. In your browser, open
http://A.B.C.D
but substitute the floating IP assigned to your instance, to access the Flask app. Upload an image and press “Submit” to get its class label.
Now that we know everything works, let’s get some quick performance numbers from this server. We’ll send some requests directly to the FastAPI endpoint and measure the time to get a response.
# runs inside the Jupyter container on node-serve-system
import requests
import time
import numpy as np
# runs inside the Jupyter container on node-serve-system
FASTAPI_URL = "http://fastapi_server:8000/predict"
payload = {"image": encoded_str}
num_requests = 100
inference_times = []
for _ in range(num_requests):
start_time = time.time()
response = requests.post(FASTAPI_URL, json=payload)
end_time = time.time()
if response.status_code == 200:
inference_times.append(end_time - start_time)
else:
print(f"Error: {response.status_code}, Response: {response.text}")
# runs inside the Jupyter container on node-serve-system
inference_times = np.array(inference_times)
median_time = np.median(inference_times)
percentile_95 = np.percentile(inference_times, 95)
percentile_99 = np.percentile(inference_times, 99)
throughput = num_requests / inference_times.sum()
print(f"Median inference time: {1000*median_time:.4f} ms")
print(f"95th percentile: {1000*percentile_95:.4f} ms")
print(f"99th percentile: {1000*percentile_99:.4f} ms")
print(f"Throughput: {throughput:.2f} requests/sec")
ONNX version
We know from our previous experiments that the vanilla PyTorch model may not be optimized for inference speed.
Let’s try porting our FastAPI endpoint to ONNX.
On the “node-serve-system” host, edit the Docker compose file:
# runs on node-serve-system
nano ~/serve-system-chi/docker/docker-compose-fastapi.yaml
and modify
context: /home/cc/serve-system-chi/fastapi_pt
to
context: /home/cc/serve-system-chi/fastapi_onnx
to build the FastAPI container image from the “fastapi_onnx” directory, instead of the “fastapi_pt” directory.
Save your changes (Ctrl+O, Enter, Ctrl+X). Rebuild the container image:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-fastapi.yaml build fastapi_server
and recreate the container with the new image:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-fastapi.yaml up fastapi_server --force-recreate -d
Repeat the same steps as before to test the FastAPI endpoint and its integration with Flask.
Then, re-do our quick benchmark.
# runs inside the Jupyter container on node-serve-system
FASTAPI_URL = "http://fastapi_server:8000/predict"
payload = {"image": encoded_str}
num_requests = 100
inference_times = []
for _ in range(num_requests):
start_time = time.time()
response = requests.post(FASTAPI_URL, json=payload)
end_time = time.time()
if response.status_code == 200:
inference_times.append(end_time - start_time)
else:
print(f"Error: {response.status_code}, Response: {response.text}")
# runs inside the Jupyter container on node-serve-system
inference_times = np.array(inference_times)
median_time = np.median(inference_times)
percentile_95 = np.percentile(inference_times, 95)
percentile_99 = np.percentile(inference_times, 99)
throughput = num_requests / inference_times.sum()
print(f"Median inference time: {1000*median_time:.4f} ms")
print(f"95th percentile: {1000*percentile_95:.4f} ms")
print(f"99th percentile: {1000*percentile_99:.4f} ms")
print(f"Throughput: {throughput:.2f} requests/sec")
Our FastAPI endpoint can maintain low latency, as long as only one user is sending requests to the service.
However, when there are multiple concurrent requests, it will be much slower. For example, suppose we start 16 “senders” at the same time, each continuously sending a new request as soon as it gets a response for the last one:
# runs inside the Jupyter container on node-serve-system
import concurrent.futures
def send_request(payload):
start_time = time.time()
response = requests.post(FASTAPI_URL, json=payload)
end_time = time.time()
if response.status_code == 200:
return end_time - start_time
else:
print(f"Error: {response.status_code}, Response: {response.text}")
return None
def run_concurrent_tests(num_requests, payload, max_workers=10):
inference_times = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(send_request, payload) for _ in range(num_requests)]
for future in concurrent.futures.as_completed(futures):
result = future.result()
if result is not None:
inference_times.append(result)
return inference_times
num_requests = 1000
start_time = time.time()
inference_times = run_concurrent_tests(num_requests, payload, max_workers=16)
total_time = time.time() - start_time
# runs inside the Jupyter container on node-serve-system
inference_times = np.array(inference_times)
median_time = np.median(inference_times)
percentile_95 = np.percentile(inference_times, 95)
percentile_99 = np.percentile(inference_times, 99)
throughput = num_requests / total_time
print(f"Median inference time: {1000*median_time:.4f} ms")
print(f"95th percentile: {1000*percentile_95:.4f} ms")
print(f"99th percentile: {1000*percentile_99:.4f} ms")
print(f"Throughput: {throughput:.2f} requests/sec")
When a request arrives at the server and finds it busy processing another request, it waits in a queue until it can be served. This queuing delay can be a significant part of the overall prediction delay, when there is a high degree of concurrency. We will attempt to address this in the next section!
In the meantime, download this entire notebook for later reference.
Then, bring down your current inference service with:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-fastapi.yaml down
Using a Triton Inference Server
Triton Inference Server is an open-source project by NVIDIA for high-performance ML model deployment. In this section, we will practice deploying models using Triton; after you have finished, you should be able to:
- serve a model using Triton Inference Server with Python backend
- use dynamic batching to improve performance
- scale your model to run on multiple GPUs, and/or with multiple instances on the same GPU
- benchmark the Triton service, and recognize indications of potential problems
- and use optimized backends
Anatomy of a Triton model with Python backend
To start, run
# runs on node-serve-system
mkdir ~/serve-system-chi/models/
cp -r ~/serve-system-chi/models_staging/food_classifier ~/serve-system-chi/models/
to copy our first configuration into the directory from which Triton will load models.
Our initial implementation serves our food image classifier using PyTorch. Here’s how it works.
In the Dockerfile, the Triton server is started with the command
tritonserver --model-repository=/models
where the /models directry is organized as follows:
models/
└── food_classifier
├── 1
│ ├── food11.pth
│ └── model.py
└── config.pbtxt
It includes:
- a top-level directory whose name is the “model name”
- a configuration file
config.pbtxtinside that directory. We’ll look at that shortly. - and a subdirectory for each model version. We have model version 1, so we have a subdirectory 1. Inside this directory is a
model.py, which describes how the model will run.
Let’s look at the configuration file first. Here are the contents of config.pbtxt:
name: "food_classifier"
backend: "python"
max_batch_size: 16
input [
{
name: "INPUT_IMAGE"
data_type: TYPE_STRING
dims: [1]
}
]
output [
{
name: "FOOD_LABEL"
data_type: TYPE_STRING
dims: [1]
},
{
name: "PROBABILITY"
data_type: TYPE_FP32
dims: [1]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
We have defined:
- a
name, which must match the directory name - a
backend- we are using the basic Python backend. This is a highly flexible backend which allows us to define how our model will run by providing Python code in amodel.pyfile. - a
max_batch_size- we have set it to 16, but generally you would set this according to the GPU memory available - the
name,data_type, anddims(dimensions) of eachinputto the model - the
name,data_type, anddims(dimensions) of eachoutputfrom the model - an
instance_groupwith thecount(number of copies of the model that we want to serve) and details of the device we want to serve it on (we will serve it on GPU 0). Note that to run the model on CPU instead, we could have used
instance_group [
{
count: 1
kind: KIND_CPU
}
]
Next, let’s look at model.py. For a Triton model with Python backend, the model.py must define a class named TritonPythonModel with at least an initialize and execute method. Ours has:
- An
initializemethod to load the model, move it to the device specified in theargspassed from the Triton server, and put it in inference mode. This will run as soon as Triton starts and loads models from the directory passed to it:
def initialize(self, args):
model_dir = os.path.dirname(__file__)
model_path = os.path.join(model_dir, "food11.pth")
# From args, get info about what device the model is supposed to be on
instance_kind = args.get("model_instance_kind", "cpu").lower()
if instance_kind == "gpu":
device_id = int(args.get("model_instance_device_id", 0))
torch.cuda.set_device(device_id)
self.device = torch.device(f"cuda:{device_id}" if torch.cuda.is_available() else 'cpu')
else:
self.device = torch.device('cpu')
self.model = torch.load(model_path, map_location=self.device, weights_only=False)
self.model.to(self.device)
self.model.eval()
self.transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
self.classes = np.array([
"Bread", "Dairy product", "Dessert", "Egg", "Fried food",
"Meat", "Noodles/Pasta", "Rice", "Seafood", "Soup",
"Vegetable/Fruit"
])
- A
preprocessmethod, which will run on each input image that is passed:
def preprocess(self, image_data):
if isinstance(image_data, str):
image_data = base64.b64decode(image_data)
if isinstance(image_data, bytes):
image_data = image_data.decode("utf-8")
image_data = base64.b64decode(image_data)
image = Image.open(io.BytesIO(image_data)).convert('RGB')
img_tensor = self.transform(image).unsqueeze(0)
return img_tensor
- and an
execute, which will apply to batches of requests sent to this model:
def execute(self, requests):
# Gather inputs from all requests
batched_inputs = []
for request in requests:
in_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT_IMAGE")
input_data_array = in_tensor.as_numpy() # each assumed to be shape [1]
# Preprocess each input (resulting in a tensor of shape [1, C, H, W])
batched_inputs.append(self.preprocess(input_data_array[0, 0]))
# Combine inputs along the batch dimension
batched_tensor = torch.cat(batched_inputs, dim=0).to(self.device)
print("BatchSize: ", len(batched_inputs))
# Run inference once on the full batch
with torch.no_grad():
outputs = self.model(batched_tensor)
# Process the outputs and split them for each request
responses = []
for i, request in enumerate(requests):
output = outputs[i:i+1] # select the i-th output
prob, predicted_class = torch.max(output, 1)
predicted_label = self.classes[predicted_class.item()]
probability = torch.sigmoid(prob).item()
# Create numpy arrays with shape [1, 1] for consistency.
out_label_np = np.array([[predicted_label]], dtype=object)
out_prob_np = np.array([[probability]], dtype=np.float32)
out_tensor_label = pb_utils.Tensor("FOOD_LABEL", out_label_np)
out_tensor_prob = pb_utils.Tensor("PROBABILITY", out_prob_np)
inference_response = pb_utils.InferenceResponse(
output_tensors=[out_tensor_label, out_tensor_prob])
responses.append(inference_response)
return responses
Finally, now that we understand how the server works, let’s look at how the Flask app sends requests to it. Inside the Flask app, we now have a function which is called whenever there is a new image uploaded to predict or test, which sends the image to the Triton server:
def request_triton(image_path):
try:
# Connect to Triton server
triton_client = httpclient.InferenceServerClient(url=TRITON_SERVER_URL)
# Prepare inputs and outputs
with open(image_path, 'rb') as f:
image_bytes = f.read()
inputs = []
inputs.append(httpclient.InferInput("INPUT_IMAGE", [1, 1], "BYTES"))
encoded_str = base64.b64encode(image_bytes).decode("utf-8")
input_data = np.array([[encoded_str]], dtype=object)
inputs[0].set_data_from_numpy(input_data)
outputs = []
outputs.append(httpclient.InferRequestedOutput("FOOD_LABEL", binary_data=False))
outputs.append(httpclient.InferRequestedOutput("PROBABILITY", binary_data=False))
# Run inference
results = triton_client.infer(model_name=FOOD11_MODEL_NAME, inputs=inputs, outputs=outputs)
predicted_class = results.as_numpy("FOOD_LABEL")[0,0]
probability = results.as_numpy("PROBABILITY")[0,0]
return predicted_class, probability
except Exception as e:
print(f"Error during inference: {e}")
return None, None
Bring up containers
To start, run
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up -d
This uses a Docker Compose configuration to bring up three containers:
- one container with NVIDIA Triton Server, with the host’s GPUs passed to the container, and with the
modelsdirectory (containing the model and its configuration) passed as a bind mount - one container that hosts the Flask app, which will serve the user interface and send inference requests to the Triton server
- one Jupyter container with the Triton client installed, for us to conduct a performance evaluation of the Triton server
Building the NVIDIA Triton Server container image for the first time normally takes about 20 minutes.
Watch the logs from the Triton server as it starts up:
# runs on node-serve-system
docker logs triton_server -f
Once the Triton server starts up, you should see something like
+--------------------------+---------+--------+
| Model | Version | Status |
+--------------------------+---------+--------+
| food_classifier | 1 | READY |
+--------------------------+---------+--------+
and then some additional output. Near the end, you will see
"Started GRPCInferenceService at 0.0.0.0:8001"
"Started HTTPService at 0.0.0.0:8000"
"Started Metrics Service at 0.0.0.0:8002"
(and then some messages about not getting GPU power consumption, which is fine and not a concern.)
You can use Ctrl+C to stop watching the logs once you see this output.
Let’s test this service. In a browser, run
http://A.B.C.D
but substitute the floating IP assigned to your instance, to access the Flask app. Upload an image and press “Submit” to get its class label.
To access the Jupyter service, we will need its randomly generated secret token (which secures it from unauthorized access). We’ll get this token by running jupyter server list inside the jupyter container:
# runs on node-serve-system
docker exec jupyter jupyter server list
Look for a line like
http://localhost:8888/?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Paste this into a browser tab, but in place of localhost, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface that is running on your compute instance.
Then, in the file browser on the left side, open the “work” directory and then click on the 6_triton.ipynb notebook to continue.
Meanwhile, on the host, run
# runs on node-serve-system
nvtop
to monitor GPU usage - we will refer back to this a few times as we run through the rest of this notebook.
Benchmarking Triton service
Continue here after opening workspace/6_triton.ipynb in the Jupyter container.
Serving a PyTorch model
The Triton client comes with a performance analyzer, which we can use to send requests to the server and get some statistics back. Let’s try it:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier --input-data input.json -b 1
Make a note of the line showing the total average request latency, and the breakdown including:
queue, the queuing delay- and
compute infer, the inference delay
Let’s further exercise this service. In the command above, a single client sends continuous requests to the server - each time a response is returned, a new request is generated. Now, let’s configure 8 concurrent clients, each sending continuous requests - as soon as any client gets a response, it sends a new request:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier --input-data input.json -b 1 --concurrency-range 8
While the inference time (compute infer) is similar to the previous example, the overall system latency is high because of queue delay. Only one sample is processed at a time, and other samples have to wait in a queue for their turn. Here, since there are 8 concurrent clients sending continuous requests, the delay is approximately 8x the inference delay.
With more concurrent requests, the queuing delay would grow even larger:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier --input-data input.json -b 1 --concurrency-range 16
Although the delay is large (over 100 ms), it’s not because of inadequate compute - if you check the nvtop display on the host while the test above is running, you will note low GPU utilization! Take a screenshot of the nvtop output when this test is running.
We could get more throughput without increasing prediction latency, by batching requests. Here, we have a single client sending requests in batches of 16 at a time:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier --input-data input.json -b 16 --concurrency-range 1
We can see that a batch of 16 requests doesn’t have much higher inference time than a single request. The throughput is substantially higher when we can serve in batches.
But, that’s not very helpful in a situation when requests come from individual users, one at a time.
Scaling up PyTorch model
One potential way to improve performance is to scale up! Let’s edit the model configuration:
# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier/config.pbtxt
and change
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
to run two instances on GPU 0 and two instances on GPU 1:
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 2
kind: KIND_GPU
gpus: [ 1 ]
}
]
Save the file (use Ctrl+O then Enter, then Ctrl+X).
Re-build the container image with this change:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
and then bring the server back up:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
and use
# runs on node-serve-system
docker logs triton_server
to make sure the server comes up and is ready.
On the host, run
# runs on node-serve-system
nvidia-smi
and note that there are two instances of triton_python_backend processes running on GPU 0, and two on GPU 1.
Then, benchmark this service with increased concurrency:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier --input-data input.json -b 1 --concurrency-range 8
There is still some queuing delay (because our degree of concurrency, 8, is still higher than the number of server instances, 4), and furthermore, the inference time is also increased due to sharing the compute resources. However, the prediction delay is on the order of 10s of ms - not over 100ms, like it was previously with concurrency 8!
Also, if you look at the nvtop output on the host while running this test, you will observe higher GPU utilization than before (which is good! We want to use the GPU. Underutilization is bad.) (Take a screenshot!) However, we are still not fully utilizing the GPU.
Let’s try increasing the number of instances again. Edit the model configuration:
# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier/config.pbtxt
and change
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 2
kind: KIND_GPU
gpus: [ 1 ]
}
]
to
instance_group [
{
count: 4
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 4
kind: KIND_GPU
gpus: [ 1 ]
}
]
Re-build the container image with this change:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
and then bring the server back up:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
use
# runs on node-serve-system
docker logs triton_server
to make sure the server comes up and is ready.
Then, re-run our benchmark:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier --input-data input.json -b 1 --concurrency-range 8
This makes things worse - our inference time is higher, even though we are still underutilizing the GPU (as seen in nvtop) (take a screenshot!).
Our system is not limited by GPU - we are underutilizing the GPU. However, we are being killed by the overhead of the Python backend and our model.py implementation.
Serving an ONNX model
The Python backend we have been using is flexible, but not necessarily the most performant. To get better performance, we will use one of the highly optimized backend in Triton. Since we already have an ONNX model, let’s use the ONNX backend.
To serve a model using the ONNX backend, we will create a directory structure like this:
food_classifier_onnx/
├── 1
│ └── model.onnx
└── config.pbtxt
There is no more model.py - Triton serves the model directly, we just have to name it model.onnx. In config.pbtxt, we will specify the backend as onnxruntime:
name: "food_classifier_onnx"
backend: "onnxruntime"
max_batch_size: 16
input [
{
name: "input" # has to match ONNX model's input name
data_type: TYPE_FP32
dims: [3, 224, 224] # has to match ONNX input shape
}
]
output [
{
name: "output" # has to match ONNX model output name
data_type: TYPE_FP32 # output is a list of probabilities
dims: [11] #
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
Copy this to Triton’s models directory:
# runs on node-serve-system
cp -r ~/serve-system-chi/models_staging/food_classifier_onnx ~/serve-system-chi/models/
Re-build the container image with this change:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
and then bring the server back up:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
and use
# runs on node-serve-system
docker logs triton_server
to make sure the server comes up and is ready. Note that the server will load two models: the original food_classifier with Python backend, and the food_classifier_onnx model we just added.
Let’s benchmark our service. Our ONNX model won’t accept image bytes directly - it expects images that already have been pre-processed into arrays. So, our benchmark command will be a little bit different:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --concurrency-range 1
This model has much better inference performance than our PyTorch model with Python backend did, in a similar test. Also, if we monitor with nvtop, we should see higher GPU utilization while the test is running (which is a good thing!) (Take a screenshot!)
Scaling up ONNX model
Let’s try scaling this model up. Edit the model configuration:
# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier_onnx/config.pbtxt
and change
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
to
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0, 1 ]
}
]
Save the file (use Ctrl+O then Enter, then Ctrl+X).
Re-build the container image with this change:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
and then bring the server back up:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
and use
# runs on node-serve-system
docker logs triton_server
to make sure the server comes up and is ready.
Then, run our benchmark with higher concurrency. (2 instances on each GPU, because we noticed that a single instance used less than half a GPU.)
(Note that in this example and the following one, we limit the number of requests sent by perf_analyzer; this is necessary because of measurement instability under high concurrency.)
Watch the nvtop output as you run this test! (Take a screenshot!)
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --concurrency-range 8 --warmup-request-count 500 --request-count 20000
This time, we should see that our model is fully utilizing the GPU (that’s good!) And, our system performance is much better than the PyTorch model with Python backend could achieve with concurrency 8. We still have very little queuing delay.
Let’s see how we do with even higher concurrency.
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --concurrency-range 16 --warmup-request-count 500 --request-count 20000
We still have some queuing delay - the average request waits longer in the queue than its actual service time! - since the rate at which requests arrive is greater than the service rate of the models.
But, we can feel good that we are no longer underutilizing the GPUs (as evidenced by nvtop output)!
There’s one more issue we should address: our ONNX model doesn’t directly work with our Flask server now, because the inputs and outputs are different. The ONNX model expects a pre-processed array, and returns a list of class probabilities.
Since the pre-processing and post-processing doesn’t need GPU anyway, we’ll move it to the Flask app.
Edit the Docker compose file:
# runs on node-serve-system
nano ~/serve-system-chi/docker/docker-compose-triton.yaml
and change
flask:
build:
context: https://github.com/teaching-on-testbeds/gourmetgram.git#triton
to
flask:
build:
context: https://github.com/teaching-on-testbeds/gourmetgram.git#triton_onnx
to use a version of our Flask app where the pre- and post-processing is built in. Also change
- FOOD11_MODEL_NAME=food_classifier
to
- FOOD11_MODEL_NAME=food_classifier_onnx
so that our Flask app will send requests to the new ONNX model service.
Then run
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build flask
to re-build the container image, and
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up flask --force-recreate -d
to restart the Flask container with the new image.
Let’s test this service. In a browser, run
http://A.B.C.D
but substitute the floating IP assigned to your instance, to access the Flask app. Upload an image and press “Submit” to get its class label.
Dynamic batching with ONNX model
Until now, we have been working to reduce delay when there is a high, but steady, flow of requests arriving at the service.
In most realistic cases, however, the rate at which requests arrive is variable. Some time may pass with only a couple of requests, and then suddenly a burst of requests arrive. This is more challenging, because the same average request rate that is easily served with a constant interarrival pattern can have queuing delay when the arrivals are bursty.
Let us explore this further in this section.
First, open the config
# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier_onnx/config.pbtxt
and let’s change back
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0, 1 ]
}
]
to
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
so we will work with just one model instance again.
Save the file (use Ctrl+O then Enter, then Ctrl+X).
Re-build the container image with this change:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
and then bring the server back up:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
and use
# runs on node-serve-system
docker logs triton_server
to make sure the server comes up and is ready.
Now we will benchmark with perf_analyzer again. But,
- instead of scaling up load with a higher
--concurrency-range, we will scale with--request-rate-range(which defines the average number of requests per second), - and we can vary the
--request-distributionbetweenconstantinterarrival time andpoisson.
(Note: when we set a request rate, the throughput will never be higher than that rate, since throughput measures requests served per second. We will ignore these throughput measurements, since they reflect the request pattern and not the server capacity.)
Let’s first try sending 120 requests per second with a constant interarrival pattern. We know from our earlier tests that with one model instance, the server is still capable of processing requests at this rate:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --request-rate-range 120 --request-distribution constant
Then, repeat with a Poisson arrival process:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --request-rate-range 120 --request-distribution poisson
With Poisson arrivals at the same average rate, requests sometimes arrive in bursts and sometimes with gaps. The bursts cause queue buildup, leading to much queue delay even though the average rate is the same.
This problem is not as easily addressed by provisioning more instances. Scaling out instances for bursty traffic is expensive and still leaves servers underutilized between spikes. Instead, we will try dynamic batching.
Earlier, we noted that our model can achieve higher throughput with low latency by performing inference on batches of input samples, instead of individual samples. But, our client sends requests with individual samples.
When requests arrive in a burst and are queued, however, we can batch them and then send them to the server as a batch, instead of in sequence. In other words, if the server is ready to handle the next request, and it finds four requests waiting in the queue, it should serve those four as a batch instead of just taking the next request in line. This approach absorbs short-term request bursts without constant overprovisioning.
Let’s edit the model configuration:
# runs on node-serve-system
nano ~/serve-system-chi/models/food_classifier_onnx/config.pbtxt
and at the end, add
dynamic_batching {
preferred_batch_size: [4, 6, 8]
max_queue_delay_microseconds: 100
}
Save the file (use Ctrl+O then Enter, then Ctrl+X).
Re-build the container image with this change:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml build triton_server
and then bring the server back up:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml up triton_server --force-recreate -d
and use
# runs on node-serve-system
docker logs triton_server
to make sure the server comes up and is ready.
Before we benchmark this service again, let’s get some pre-benchmark stats about how many requests have been served, broken down by batch size. (If you’ve just restarted the server, it would be zero!)
# runs inside the Jupyter container on node-serve-system
curl -s http://triton_server:8000/v2/models/food_classifier_onnx/versions/1/stats | python -m json.tool
Then, run the benchmark again with Poisson arrivals:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --request-rate-range 120 --request-distribution poisson
and get per-batch stats again:
# runs inside the Jupyter container on node-serve-system
curl -s http://triton_server:8000/v2/models/food_classifier_onnx/versions/1/stats | python -m json.tool
Observe that the stats show that some requests were served in batch sizes greater than 1, even though each client sent a single request at a time.
When the average queuing delay is still low, we may not see much improvement in overall latency due to dynamic batching. Under these circumstances, even with dynamic batching on, a request that arrives while the server is busy will still have to wait (on average) for half of an inference time. But, watch what happens when we scale up the request rate:
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --request-rate-range 180 --request-distribution poisson
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --request-rate-range 240 --request-distribution poisson
# runs inside the Jupyter container on node-serve-system
perf_analyzer -u triton_server:8000 -m food_classifier_onnx -b 1 --shape IMAGE:3,224,224 --request-rate-range 300 --request-distribution poisson
Even as we increase the request rate, the average request will still only wait half of a service time, because once the request that is currently in service finishes, every request waiting in the queue is processed as a batch.
(In fact, we may even see less overall latency for higher request rates, because the GPU remains “warm”.)
When you have finished, download this entire notebook for later reference.
Then, bring down your current inference service with:
# runs on node-serve-system
docker compose -f ~/serve-system-chi/docker/docker-compose-triton.yaml down
Questions about this material? Contact Fraida Fund
This material is based upon work supported by the National Science Foundation under Grant No. 2230079.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.