Offline evaluation of ML systems

In this tutorial, we will practice selected techniques for evaluating machine learning systems, and then monitoring them in production.

The lifecycle of a model may look something like this:

In this particular section, we will evaluate a model in the offline testing stage - when it is not yet deployed as a “live” service accepting requests from real users.

This tutorial focuses on the offline testing stage.

To run this experiment, you should have already created an account on Chameleon, and become part of a project. You should also have added your SSH key to the KVM@TACC site.

Experiment resources

For this experiment, we will provision one virtual machine on KVM@TACC.

Open this experiment on Trovi

When you are ready to begin, you will continue with the next step, in which you bring up and configure a VM instance! To begin this step, open this experiment on Trovi:

Launch and set up a VM instance- with python-chi

We will use the python-chi Python API to Chameleon to provision our VM server.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected.

from chi import server, context
import chi, os, time, datetime

context.version = "1.0" 
context.choose_project()
context.choose_site(default="KVM@TACC")

We will use bring up a m1.medium flavor server with the CC-Ubuntu24.04 disk image.

Note: the following cell brings up a server only if you don’t already have one with the same name! (Regardless of its error state.) If you have a server in ERROR state already, delete it first in the Horizon GUI before you run this cell.

username = os.getenv('USER') # all exp resources will have this prefix
s = server.Server(
    f"node-eval-offline-{username}", 
    image_name="CC-Ubuntu24.04",
    flavor_name="m1.medium"
)
s.submit(idempotent=True)

Then, we’ll associate a floating IP with the instance:

s.associate_floating_ip()
s.refresh()
s.check_connectivity()

In the output below, make a note of the floating IP that has been assigned to your instance (in the “Addresses” row).

s.refresh()
s.show(type="widget")

By default, all connections to VM resources are blocked, as a security measure. We need to attach one or more “security groups” to our VM resource, to permit access over the Internet to specified ports.

The following security groups will be created (if they do not already exist in our project) and then added to our server:

security_groups = [
  {'name': "allow-ssh", 'port': 22, 'description': "Enable SSH traffic on TCP port 22"},
  {'name': "allow-8888", 'port': 8888, 'description': "Enable TCP port 8888 (used by Jupyter)"}
]

# configure openstacksdk for actions unsupported by python-chi
os_conn = chi.clients.connection()
nova_server = chi.nova().servers.get(s.id)

for sg in security_groups:

  if not os_conn.get_security_group(sg['name']):
      os_conn.create_security_group(sg['name'], sg['description'])
      os_conn.create_security_group_rule(sg['name'], port_range_min=sg['port'], port_range_max=sg['port'], protocol='tcp', remote_ip_prefix='0.0.0.0/0')

  nova_server.add_security_group(sg['name'])

print(f"updated security groups: {[group.name for group in nova_server.list_security_group()]}")

Retrieve code and notebooks on the instance

Now, we can use python-chi to execute commands on the instance, to set it up. We’ll start by retrieving the code and other materials on the instance.

s.execute("git clone https://github.com/teaching-on-testbeds/eval-offline-chi")

Set up Docker

Here, we will set up the container framework.

s.execute("curl -sSL https://get.docker.com/ | sudo sh")
s.execute("sudo groupadd -f docker; sudo usermod -aG docker $USER")

Open an SSH session

Finally, open an SSH sesson on your server. From your local terminal, run

ssh -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

where

Prepare data

For the rest of this tutorial, we’ll be evaluating models on the Food-11 dataset. We’re going to prepare a Docker volume with this dataset already prepared on it, so that the containers we create later can attach to this volume and access the data.

First, create the volume:

# runs on node-eval-offline
docker volume create food11

Then, to populate it with data, run

# runs on node-eval-offline
docker compose -f eval-offline-chi/docker/docker-compose-data.yaml up -d

This will run a temporary container that downloads the Food-11 dataset, organizes it in the volume, and then stops. It may take a minute or two. You can verify with

# runs on node-eval-offline
docker ps

that it is done - when there are no running containers.

Finally, verify that the data looks as it should. Start a shell in a temporary container with this volume attached, and ls the contents of the volume:

# runs on node-eval-offline
docker run --rm -it -v food11:/mnt alpine ls -l /mnt/Food-11/

it should show “evaluation”, “validation”, and “training” subfolders.

Launch a Jupyter container

Inside the SSH session, start a Jupyter container:

# run on node-eval-offline
docker run  -d --rm  -p 8888:8888 \
    -v ~/eval-offline-chi/workspace:/home/jovyan/work/ \
    -v food11:/mnt/ \
    -e FOOD11_DATA_DIR=/mnt/Food-11 \
    --name jupyter \
    quay.io/jupyter/pytorch-notebook:pytorch-2.5.1

Run

# run on node-eval-offline
docker logs jupyter

and look for a line like

http://127.0.0.1:8888/lab?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Paste this into a browser tab, but in place of 127.0.0.1, substitute the floating IP assigned to your instance, to open the Jupyter notebook interface that is running on your compute instance.

Then, in the file browser on the left side, open the “work” directory and then click on the eval_offline.ipynb notebook to continue.

Evaluate a model offline

In this section, we will practice offline evaluation of a model! After you finish this section, you should understand how to:

Let’s start by loading our trained model and our test data.

# runs in jupyter container on node-eval-offline
import os
import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import time
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
from PIL import Image
# runs in jupyter container on node-eval-offline
model_path = "models/food11.pth"  
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = torch.load(model_path, map_location=device, weights_only=False)
_ = model.eval()  
# runs in jupyter container on node-eval-offline
food_11_data_dir = os.getenv("FOOD11_DATA_DIR", "Food-11")
val_test_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
test_dataset = datasets.ImageFolder(root=os.path.join(food_11_data_dir, 'evaluation'), transform=val_test_transform)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

Evaluate model on general metrics for its domain

The most basic evaluation will involve general metrics that are relevant to the specific domain.

For example, in this case, our model is a classification model, so we will compute accuracy. But, for other models we would consider other metrics -

To start, let’s get the predictions of the model on the held-out test set:

# runs in jupyter container on node-eval-offline
dataset_size = len(test_loader.dataset)
all_predictions = np.empty(dataset_size, dtype=np.int64)
all_labels = np.empty(dataset_size, dtype=np.int64)

current_index = 0

with torch.no_grad():
    for images, labels in test_loader:
        batch_size = labels.size(0)
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)

        all_predictions[current_index:current_index + batch_size] = predicted.cpu().numpy()
        all_labels[current_index:current_index + batch_size] = labels.cpu().numpy()
        current_index += batch_size

We can use these to compute the overall accuracy:

# runs in jupyter container on node-eval-offline
overall_accuracy = (all_predictions == all_labels).sum() / all_labels.shape[0] * 100
print(f'Overall Accuracy: {overall_accuracy:.2f}%')

We can also compute the per-class accuracy. It would be concerning if our classifier had very low accuracy for some classes, even if it has high accuracy for others. We might set a criteria that e.g. the model we deploy must have a minimum overall accuracy, and then a different minimum per-class accuracy for all classes.

# runs in jupyter container on node-eval-offline
classes = np.array(["Bread", "Dairy product", "Dessert", "Egg", "Fried food",
    "Meat", "Noodles/Pasta", "Rice", "Seafood", "Soup", "Vegetable/Fruit"])
num_classes = classes.shape[0]

# runs in jupyter container on node-eval-offline
per_class_correct = np.zeros(num_classes, dtype=np.int32)
per_class_total = np.zeros(num_classes, dtype=np.int32)

for true_label, pred_label in zip(all_labels, all_predictions):
    per_class_total[true_label] += 1
    per_class_correct[true_label] += int(true_label == pred_label)

for i in range(num_classes):
    if per_class_total[i] > 0:
        acc = per_class_correct[i] / per_class_total[i] * 100
        correct_str = f"{per_class_correct[i]}/{per_class_total[i]}"
        print(f"{classes[i]:<20} {acc:10.2f}% {correct_str:>20}")

And, we can use a confusion matrix to see which classes are most often confused with one another:

# runs in jupyter container on node-eval-offline
conf_matrix = np.zeros((num_classes, num_classes), dtype=np.int32)
for true_label, pred_label in zip(all_labels, all_predictions):
    conf_matrix[true_label, pred_label] += 1

plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', xticklabels=classes, yticklabels=classes, cmap='Blues')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

This can help us get some insight into our model. For example, we see in the confusion matrix above that “Dessert” and “Dairy product” samples are often confused for one another.

Use human judgement and explainable AI techniques to “sanity check” a model

Let’s use our human judgement to better understand some of these errors.

# runs in jupyter container on node-eval-offline

# Get random sample of Dessert and Dairy product samples that are confused for one another
dessert_index = np.where(classes == "Dessert")[0][0]
dairy_index = np.where(classes == "Dairy product")[0][0]

confused_indices = [i for i, (t, p) in enumerate(zip(all_labels, all_predictions))
                    if (t == dessert_index and p == dairy_index) or (t == dairy_index and p == dessert_index)]

sample_indices = np.random.choice(confused_indices, size=min(5, len(confused_indices)), replace=False)

# Actually, to make it easier to discuss - we will select specific samples and get those samples from the test loader
sample_indices = np.array([404, 927, 496, 435, 667])
# runs in jupyter container on node-eval-offline

sample_images = []
start_idx = 0
for images, _ in test_loader:
    batch_size = images.size(0)
    end_idx = start_idx + batch_size
    for idx in sample_indices:
        if start_idx <= idx < end_idx:
            image = images[idx - start_idx].cpu()
            sample_images.append((idx, image))
    start_idx = end_idx
    if len(sample_images) == len(sample_indices):
        break
# runs in jupyter container on node-eval-offline
mean = torch.tensor([0.485, 0.456, 0.406])
std = torch.tensor([0.229, 0.224, 0.225])
# Visualize those samples (undo the normalization first)
plt.figure(figsize=(12, 3))
for i, (idx, image) in enumerate(sample_images):
    image = image * std[:, None, None] + mean[:, None, None]  # unnormalize
    image = torch.clamp(image, 0, 1)
    image = image.permute(1, 2, 0)  # go from "channels, height, width" format to "height, width, channels"
    plt.subplot(1, len(sample_images), i + 1)
    plt.imshow(image)
    plt.title(f"True: {classes[all_labels[idx]]}\nPred: {classes[all_predictions[idx]]}\nIndex: {idx}")
    plt.axis('off')
plt.tight_layout()
plt.show()

Now we can better understand some of these errors.

For example, sample 496 appears to be ice cream, which is both a dessert and a dairy product. Similarly, sample 927 is cheesecake, which can also be considered both dessert and a dairy product. It is not clear what the instructions to human annotators were, and whether “ice cream” and “cheesecake” are consistently labeled in the training set, or whether some annotators might have labeled these as dessert and others as dairy products.

Having identified this problem, we can investigate further and perhaps improve our labeling instructions. We can also re-consider our choice to model this problem as a one-label classification problem, instead of a multi-label classification problem.

We can further “sanity check” a model with explainable AI techniques to better understand which features in an input sample are most influential for it prediction. This can help us find:

Depending on the type of data and model, we can use different explainable AI techniques. For example:

and more generally, techniques such as SHAP and LIME are applicable to many types of models.

Let’s try using GradCAM to highlight the parts of the image that are most influential:

# runs in jupyter container on node-eval-offline
from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.image import show_cam_on_image
from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget

# GradCAM setup 
target_layer = model.features[-1]  
cam = GradCAM(model=model, target_layers=[target_layer])

# runs in jupyter container on node-eval-offline

mean = torch.tensor([0.485, 0.456, 0.406])
std = torch.tensor([0.229, 0.224, 0.225])

plt.figure(figsize=(12, 3))
for i, (idx, image) in enumerate(sample_images):
    input_tensor = (image.clone() - mean[:, None, None]) / std[:, None, None]  # normalize
    input_tensor = input_tensor.unsqueeze(0)  # add batch dim

    target_category = int(all_predictions[idx])
    grayscale_cam = cam(input_tensor=input_tensor, targets=[ClassifierOutputTarget(target_category)])
    grayscale_cam = grayscale_cam[0, :]

    image_disp = image * std[:, None, None] + mean[:, None, None]  # unnormalize
    image_disp = torch.clamp(image_disp, 0, 1).permute(1, 2, 0).numpy()

    visualization = show_cam_on_image(image_disp, grayscale_cam, use_rgb=True)
    plt.subplot(1, len(sample_images), i + 1)
    plt.imshow(visualization)
    plt.title(f"True: {classes[all_labels[idx]]}\nPred: {classes[all_predictions[idx]]}\nIndex: {idx}")
    plt.axis('off')
plt.tight_layout()
plt.show()

Now we have gained additional insight:

Evaluate a model with template-based tests

In the previous part, we saw indications that the model may focus on spurious features of the image - like the fork - when we would prefer for it to focus on the actual food item.

Let’s design a template-based test to help evaluate the extent to which:

Inside the “templates” directory, we have prepared some images as follows:

Let’s look at these now.

# runs in jupyter container on node-eval-offline

TEMPLATE_DIR = "templates"

fig, axes = plt.subplots(3, 3, figsize=(8, 6))

# Random food class
food_dir = os.path.join(TEMPLATE_DIR, "food")
food_classes = [d for d in os.listdir(food_dir) if os.listdir(os.path.join(food_dir, d))]
random_class = random.choice(food_classes)
food_images = random.sample(os.listdir(os.path.join(food_dir, random_class)), 3)
food_paths = [os.path.join(food_dir, random_class, f) for f in food_images]

for i, path in enumerate(food_paths):
    axes[0, i].imshow(Image.open(path))
    axes[0, i].set_title(f"Food ({random_class})")
    axes[0, i].axis("off")

# Backgrounds
bg_dir = os.path.join(TEMPLATE_DIR, "background")
bg_images = random.sample(os.listdir(bg_dir), 3)
bg_paths = [os.path.join(bg_dir, f) for f in bg_images]

for i, path in enumerate(bg_paths):
    axes[1, i].imshow(Image.open(path))
    axes[1, i].set_title("Background")
    axes[1, i].axis("off")

# Extras
extra_dir = os.path.join(TEMPLATE_DIR, "extras")
extra_images = random.sample(os.listdir(extra_dir), 3)
extra_paths = [os.path.join(extra_dir, f) for f in extra_images]

for i, path in enumerate(extra_paths):
    axes[2, i].imshow(Image.open(path))
    axes[2, i].set_title("Extra")
    axes[2, i].axis("off")

plt.tight_layout()
plt.show()

and, here is a function that can compose an image out of a background, a food item, and (optionally) an extra:

# runs in jupyter container on node-eval-offline

def compose_image(food_path, bg_path=None, extra_path=None):

    food = Image.open(food_path).convert("RGBA")

    if bg_path:
        bg = Image.open(bg_path).convert("RGBA")
    else:
        bg = Image.new("RGBA", food.size, (255, 255, 255, 255))

    bg_w, bg_h = bg.size
    y_offset = int(bg_h * 0.05)
            
    food_scale = 0.5
    food = food.resize((int(bg_w * food_scale), int(bg_h * food_scale)))
    
    fd_w, fd_h = food.size

    if extra_path:
        
        extra_scale = 0.35
        extra = Image.open(extra_path).convert("RGBA")
        extra = extra.resize((int(bg_w * extra_scale), int(bg_h * extra_scale)))
        ex_w, ex_h = extra.size
        bg.paste(extra, (bg_w - ex_w, bg_h - ex_h - y_offset), extra)
        
    bg.paste(food, ((bg_w - fd_w) // 2, bg_h - fd_h - y_offset), food)

    return bg.convert("RGB")

Let’s try composing:

These should all have the same prediction. Notice that this is a test we could even run on an unlabeled food item - we don’t need to know the actual class (although we do, in this case).

We can also try:

# runs in jupyter container on node-eval-offline
imgs = {
    'original_image': compose_image('templates/food/09/001.png'),
    'composed_bg1_extra1': compose_image('templates/food/09/001.png', 'templates/background/001.jpg', 'templates/extras/spoon.png'),
    'composed_bg2_extra2': compose_image('templates/food/09/001.png', 'templates/background/002.jpg', 'templates/extras/fork.png'),
    'composed_same_class': compose_image('templates/food/09/002.png', 'templates/background/001.jpg'),
    'composed_diff_class': compose_image('templates/food/05/002.png', 'templates/background/001.jpg')
}

and, let’s look at these examples:

# runs in jupyter container on node-eval-offline
fig, axes = plt.subplots(1, 5, figsize=(14, 3))

for ax, key in zip(axes, imgs.keys()):
    ax.imshow(imgs[key].resize((224,224)).crop((16, 16, 224, 224)))
    ax.set_title(f"{key}")
    ax.axis("off")

plt.tight_layout()
plt.show()

We can get the predictions of the model for these samples, as well as the GradCAM output -

# runs in jupyter container on node-eval-offline

def predict(model, image, device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')):
    model.eval()
    image_tensor = val_test_transform(image).unsqueeze(0).to(device)
    with torch.no_grad():
        output = model(image_tensor)
        return output.argmax(dim=1).item()

# runs in jupyter container on node-eval-offline

fig, axes = plt.subplots(2, 5, figsize=(14, 6))

for i, key in enumerate(imgs.keys()):
    image_np = np.array(imgs[key].resize((224, 224))).astype(dtype=np.float32) / 255.0
    pred = predict(model, imgs[key])

    input_tensor = val_test_transform(imgs[key]).unsqueeze(0)
    grayscale_cam = cam(input_tensor=input_tensor, targets=[ClassifierOutputTarget(pred)])[0]
    vis = show_cam_on_image(image_np, grayscale_cam, use_rgb=True)

    axes[0, i].imshow(imgs[key].resize((224, 224)))
    axes[0, i].set_title(f"{key}\nPredicted: {pred} ({classes[pred]})")
    axes[0, i].axis("off")

    axes[1, i].imshow(vis)
    axes[1, i].axis("off")

plt.tight_layout()
plt.show()

That seemed OK - but let’s try it for a different combination of food item, background, and “extra” item:

# runs in jupyter container on node-eval-offline

imgs = {
    'original_image': compose_image('templates/food/10/001.png'),
    'composed_bg1_extra1': compose_image('templates/food/10/001.png', 'templates/background/003.jpg', 'templates/extras/plastic_fork.png'),
    'composed_bg2_extra2': compose_image('templates/food/10/001.png', 'templates/background/002.jpg', 'templates/extras/fork.png'),
    'composed_same_class': compose_image('templates/food/10/002.png', 'templates/background/003.jpg'),
    'composed_diff_class': compose_image('templates/food/05/003.png', 'templates/background/003.jpg')
}

and repeat the visualization:

# runs in jupyter container on node-eval-offline

fig, axes = plt.subplots(2, 5, figsize=(14, 6))

for i, key in enumerate(imgs.keys()):
    image_np = np.array(imgs[key].resize((224, 224))).astype(dtype=np.float32) / 255.0
    pred = predict(model, imgs[key])

    input_tensor = val_test_transform(imgs[key]).unsqueeze(0)
    grayscale_cam = cam(input_tensor=input_tensor, targets=[ClassifierOutputTarget(pred)])[0]
    vis = show_cam_on_image(image_np, grayscale_cam, use_rgb=True)

    axes[0, i].imshow(imgs[key].resize((224, 224)))
    axes[0, i].set_title(f"{key}\nPredicted: {pred} ({classes[pred]})")
    axes[0, i].axis("off")

    axes[1, i].imshow(vis)
    axes[1, i].axis("off")

plt.tight_layout()
plt.show()

Now we can see that our model is not nearly as robust as we might want it to be. We can see that some perturbations that should not change the model output do, and the model sometimes focuses on spurious items like the background or extra items that happen to be alongside the food in the image.

This type of test can be automated - we can collect a large number of food items, background, and “extras” and systematically evaluate this robustness as part of a test suite. We’ll get to that a little bit later, though.

Evaluate a model on slices of interest

Next, let’s evaluate our model on slices of interest. This will help us understand:

For example, our model is reasonably accurate on “Dessert” samples. However, looking at some samples from this class:

# runs in jupyter container on node-eval-offline
dessert_index = np.where(classes == "Dessert")[0][0]
dessert_images = []

for images, labels in test_loader:
    for img, label in zip(images, labels):
        if label.item() == dessert_index:
            dessert_images.append(img)
        if len(dessert_images) == 20:
            break
    if len(dessert_images) == 20:
        break

fig, axes = plt.subplots(4, 5, figsize=(10, 10))
for ax, img in zip(axes.flat, dessert_images):
    img = img * torch.tensor([0.229, 0.224, 0.225]).view(3,1,1) + torch.tensor([0.485, 0.456, 0.406]).view(3,1,1)
    img = torch.clamp(img, 0, 1)
    ax.imshow(img.permute(1, 2, 0).numpy())
    ax.set_title("Dessert")
    ax.axis("off")

plt.tight_layout()
plt.show()

we can see that these are mostly Western-style desserts.

To evaluate whether our model is similarly effective at classifying food items from other cuisines, we might compile one or more test suites with non-Western food items. The images in the “indian_dessert” directory are also desserts -

# runs in jupyter container on node-eval-offline

dessert_dir = "indian_dessert"
dessert_images = random.sample(os.listdir(dessert_dir), 5)
fig, axes = plt.subplots(1, 5, figsize=(10, 3))

for ax, img_name in zip(axes, dessert_images):
    path = os.path.join(dessert_dir, img_name)
    image = Image.open(path).convert("RGB")
    pred = predict(model, image)
    ax.imshow(image.resize((224, 224)).crop((16, 16, 224, 224)))
    ax.set_title(f"Predicted: {pred}\n({classes[pred]})")
    ax.axis("off")

plt.tight_layout()
plt.show()

but our model has much less predictive accuracy on these dessert samples.

Evaluate a model on known failure modes

We might also consider evaluating a model on “known” failure modes - if a model has previously failed on a specific type of image, especially in a high-profile way, we will want to set up a test so that future versions of the model will be evaluated against this type of failure.

For example: Suppose that there is a trend on social media of making cakes that look like other items. This has been a high-profile failure for GourmetGram in the past, when users upload e.g. a photo of a cake that looks like (for example…) a stick of butter, and it is tagged as “Dairy product” instead of “Dessert”.

We could compile a set of test images related to this failure mode, and set it up as a separate test.

All of the photos in the “cake_looks_like” directory are actually photos of cake:

# runs in jupyter container on node-eval-offline
cake_dir = "cake_looks_like"
cake_images = random.sample(os.listdir(cake_dir), 5)
fig, axes = plt.subplots(1, 5, figsize=(10, 3))

for ax, img_name in zip(axes, cake_images):
    path = os.path.join(cake_dir, img_name)
    image = Image.open(path).convert("RGB")
    pred = predict(model, image)
    ax.imshow(image.resize((224, 224)).crop((16, 16, 224, 224)))
    ax.set_title(f"{img_name}, Predicted: {pred}\n({classes[pred]})")
    ax.axis("off")

plt.tight_layout()
plt.show()

but, our model has a much lower accuracy on these samples than it did overall or on the general “Dessert” category.

Create a test suite

Finally, let’s create a non-interactive test suite that we can run each time we re-train a model. We’ll use pytest as a unit test framework. Our test scripts are in “tests”. Open these Python files to review then.

Inside our “conftest.py”, we have some functions that are “fixtures”. These provide some shared context for the tests. We have defined:

Then, we can pass model, test_data, and/or predictions to any of our test functions - these fixture functions will run only once and then provide their values to all test cases in a session. Similarly, we can pass predict to a test function and it will then be able to call our already-defined predict function.

Next, we have the test functions, which mirror what we have done in this notebook - but now, we have defined criteria for passing. A test function raises an exception if the criteria for passing are not met:

Of course, in a realistic setting, we would have a more comprehensive series of tests (e.g. we would have many more templates; we would evaluate the model on other forms of non-Western cuisine and not only Indian desserts, etc.). This is just a demonstration of how these tests would be automated.

Once we have defined these tests, we can run our test suite with:

# runs in jupyter container on node-eval-offline
!pytest --verbose --tb=no tests/

where:

We have only scratched the surface with pytest. For example, we can re-run the tests that failed in the last run with --lf:

# runs in jupyter container on node-eval-offline
!pytest --verbose --lf --tb=no tests/

or we can run only the tests from a particular test file:

# runs in jupyter container on node-eval-offline
!pytest --verbose --tb=no tests/test_food11_test_cases.py

You can learn more about its capabilities in its documentation.


When you are finished with this section - save and then download the fully executed notebook from the Jupyter container environment for later reference. (Note: because it is an executable file, and you are downloading it from a site that is not secured with HTTPS, you may have to explicitly confirm the download in some browsers.)

Delete resources

When we are finished, we must delete the VM server instance to make the resources available to other users.

We will execute the cells in this notebook inside the Chameleon Jupyter environment.

Run the following cell, and make sure the correct project is selected.

from chi import server, context
import chi, os, time, datetime

context.version = "1.0" 
context.choose_project()
context.choose_site(default="KVM@TACC")
username = os.getenv('USER') # all exp resources will have this prefix
s = server.get_server(f"node-eval-offline-{username}")
s.delete()

Questions about this material? Contact Fraida Fund


This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Questions about this material? Contact Fraida Fund


This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.