Serving machine learning models on edge devices

In this tutorial, we will benchmark machine learning models on a low-resource edge device (a Raspberry Pi 5 with Arm Cortext A76 processor). We will measure the inference time of:

To run this experiment, you should have already created an account on Chameleon, and become part of a project.

Context

The premise of this example is as follows: You are working as a machine learning engineer at a small startup company called GourmetGram. They are developing an online photo sharing community focused on food. You have developed a convolutional neural network in Pytorch that automatically classifies photos of food into one of a set of categories: Bread, Dairy product, Dessert, Egg, Fried food, Meat, Noodles/Pasta, Rice, Seafood, Soup, and Vegetable/Fruit.

Now that you have trained a model, you are preparing to serve predictions using this model. Your manager has advised that since GourmetGram is an early-stage startup, they can’t afford much compute for serving models. Your manager wants you to prepare a few different options, that they will then price out among cloud providers and decide which to use:

You have evaluated your model on server-grade CPU and GPU already; now you are ready to benchmark on a low-resource edge device.

Experiment resources

For this experiment, we will provision one Raspberry Pi 5 at CHI@Edge. Edge devices, like bare metal devices, need to be reserved in advance.

Create a lease for an edge device

For this experiment, we will reserve a 2-hour block on a Raspberry Pi 5.

We can use the OpenStack graphical user interface, Horizon, to submit a lease. To access this interface,

Then,

Raspberry Pi Node UUID
nyu-rpi5-01 c516acb2-4c88-42be-857f-2f9eb4139f99
nyu-rpi5-02 8334d598-d25d-4dbb-a416-d90f8e93ccc4
nyu-rpi5-03 a755b236-580e-4040-874c-80501f00f954
nyu-rpi5-04 9a7823ea-bf40-4141-b7d7-943bfb389091
nyu-rpi5-05 54a6e248-cbb2-472d-bde7-0b4ac3bd911a
nyu-rpi5-06 52341ad4-ff91-4516-a3ec-88687a8d984b
nyu-rpi5-07 7bfe3fe8-6f1e-41a4-8d9a-aa9668db8805

Your lease status should show as “Pending”. Click on the lease to see an overview. It will show the start time and end time, and it will show the name of the physical host that is reserved for you as part of your lease. Make sure that the lease details are correct.

At the beginning of your edge device lease

At the beginning of your edge device lease time, you will continue with the next step, in which you will launch a container on the device! To begin this step, open this experiment on Trovi:

Launch a container on an edge device - with python-chi

At the beginning of the lease time for your device, we will use the python-chi Python API to Chameleon to launch a container on it, using OpenStack’s Zun container service.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected. Make sure the site is set to CHI@Edge.

from chi import container, context, lease
import os
import chi

context.version = "1.0" 
context.choose_project()
context.choose_site(default="CHI@Edge")

Change the string in the following cell to reflect the name of your lease (with your own net ID), then run it to get your lease:

l = lease.get_lease(f"serve_edge_netID") 
l.show()

The status should show as “ACTIVE” now that we are past the lease start time.

We will use the lease to launch a Jupyter notebook container on a Raspberry Pi 5 edge device.

Note: the following cell brings up a container only if you don’t already have one with the same name! (Regardless of its error state.) If you have a container in ERROR state already, delete it first in the Horizon GUI before you run this cell.

username = os.getenv('USER') # exp resources will have this suffix
c = container.Container(
    name = f"node-serve-edge-{username}".replace('_', '-'),
    reservation_id = l.device_reservations[0]["id"],
    image_ref = "quay.io/jupyter/minimal-notebook:latest", 
    exposed_ports = [8888]
)
c.submit(idempotent=True)

Then, we’ll associate a floating IP with the container, so that we can access the Jupyter service running in it.

c.associate_floating_ip()

In the output above, make a note of the floating IP that has been assigned to your container.

Let’s retrieve a copy of these materials on the container:

stdout, code = c.execute("git clone https://github.com/teaching-on-testbeds/serve-edge-chi.git")
print(stdout)
stdout, code = c.execute("mv serve-edge-chi/workspace/models work/")
print(stdout)
stdout, code = c.execute("mv serve-edge-chi/workspace/measure_pi.ipynb work/")
print(stdout)

and, install the ONNX runtime Python module:

stdout, code = c.execute("python3 -m pip install onnxruntime")
print(stdout)

Finally, we will get the container logs. Run:

print(chi.container.get_logs(c.id))

and look for a line like

    http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Paste this into a browser tab, but in place of 127.0.0.1, substitute the floating IP assigned to your container, to open the Jupyter notebook interface that is running on your Raspberry Pi 5.

Then, in the file browser on the left side, open the “work” directory and find the measure_pi.ipynb notebook to continue.

Measure inference performance of ONNX model on low-resource edge device

Now, we’re going to benchmark a couple of previously created ONNX models on our low-resource edge device.

You will execute this notebook in a Jupyter container running on an edge device, not on the general-purpose Chameleon Jupyter environment from which you provision resources.

import os, time
import numpy as np
import onnxruntime as ort

We’ll define a benchmark function. For convenience (since we don’t need real data for benchmarking) we will use random “fake” samples to evaluate our models’ inference performance.

def benchmark_session(ort_session):

    ## Benchmark inference latency for single sample

    num_trials = 100  # Number of trials
    input_shape = ort_session.get_inputs()[0].shape  # Get expected input shape
    input_dtype = np.float32  # Adjust dtype as needed
    fixed_shape = (1, *input_shape[1:])  

    # Generate a single dummy sample with random values
    single_sample = np.random.rand(*fixed_shape).astype(input_dtype)

    # Warm-up run
    ort_session.run(None, {ort_session.get_inputs()[0].name: single_sample})

    latencies = []
    for _ in range(num_trials):
        start_time = time.time()
        _ = ort_session.run(None, {ort_session.get_inputs()[0].name: single_sample})
        latencies.append(time.time() - start_time)

    print(f"Inference Latency (single sample, median): {np.percentile(latencies, 50) * 1000:.2f} ms")
    print(f"Inference Latency (single sample, 95th percentile): {np.percentile(latencies, 95) * 1000:.2f} ms")
    print(f"Inference Latency (single sample, 99th percentile): {np.percentile(latencies, 99) * 1000:.2f} ms")
    print(f"Inference Throughput (single sample): {num_trials/np.sum(latencies):.2f} FPS")

Now, let’s evaluate our “baseline” ONNX model:

onnx_model_path = "models/food11.onnx"
ort_session = ort.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])
benchmark_session(ort_session)

the model quantized with dynamic quantization:

onnx_model_path = "models/food11_quantized_dynamic.onnx"
ort_session = ort.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])
benchmark_session(ort_session)

and the model quantized with static quantization, for which we permit up to 0.05 decrease in accuracy:

onnx_model_path = "models/food11_quantized_aggressive.onnx"
ort_session = ort.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])
benchmark_session(ort_session)

When you are done, download the fully executed notebook from the Jupyter container environment for later reference. (Note: because it is an executable file, and you are downloading it from a site that is not secured with HTTPS, you may have to explicitly confirm the download in some browsers.)


Questions about this material? Contact Fraida Fund


This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.