Build an MLOps Pipeline

In Cloud Computing on Chameleon, following the premise:

You are working at a machine learning engineer at a small startup company called GourmetGram. They are developing an online photo sharing community focused on food. You are testing a new model you have developed that automatically classifies photos of food into one of a set of categories: Bread, Dairy product, Dessert, Egg, Fried food, Meat, Noodles/Pasta, Rice, Seafood, Soup, and Vegetable/Fruit. You have built a simple web application with which to test your model and get feedback from others.

we deployed a basic machine learning service to an OpenStack cloud. However, that deployment involved a lot of manual steps (“ClickOps”), and any updates to it would similarly involve lots of manual effort, be difficult to track, etc.

In this tutorial, we will learn how to automate both the initial deployment, and updates during the lifecycle of the application. We will:

Our experiment will use the following automated deployment and lifecycle management tools:

Note: that we use Argo CD and Argo Workflows, which are tightly integrated with Kubernetes, because we are working in the context of a Kubernetes deployment. If our service was not deployed in Kubernetes (for example: it was deployed using “plain” Docker containers), we would use other tools for managing the application and model lifecycle.

The expected hands-on duration of this experiment is 5-6 hours. However, there is an unattended installation step in the middle (Kubernetes setup) that you may need to leave running for 0.5-1 hours. You should plan accordingly, to e.g. leave that stage running while you do something else, then return to finish.

To run this experiment, you should have already created an account on Chameleon, and become part of a project. You should also have added your SSH key to the KVM@TACC site.

Experiment topology

In this experiment, we will deploy a 3-node Kubernetes cluster on Chameleon instances. The Kubernetes cluster will be self-managed, which means that the infrastructure provider is not responsbile for setting up and maintaining our cluster; we are.

However, the cloud infrastructure provider will provide the compute resources and network resources that we need. We will provision the following resources for this experiment:

Experiment topology.

Provision a key

Before you begin, open this experiment on Trovi:

You will see several notebooks inside the mlops-chi directory - look for the one titled 00_intro.ipynb. Open this notebook and execute the following cell (and make sure the correct project is selected):

# runs in Chameleon Jupyter environment
from chi import server, context

context.version = "1.0" 
context.choose_project()
context.choose_site(default="KVM@TACC")
# runs in Chameleon Jupyter environment
server.update_keypair()

One more note: the rest of these materials assume that the required security groups are already set up within the project! If you are a student working on these materials as part of an assignment, your instructor will have done this already on behalf of the entire class (it only needs to be run once within a project!), so you can move on to the next step. Otherwise, use x_setup_sg.ipynb to make sure you have all the security groups you will need.

Then, you may continue following along at Build an MLOps Pipeline.

Prepare the environment

In keeping with good DevOps practices, we will deploy our infrastructure - starting with the Kubernetes cluster - using infrastructure-as-code and configuration-as-code principles:

We will use two IaC/CaC tools to prepare our Kubernetes cluster:

both of which are aligned with the principles above.

In this notebook, which will run in the Chameleon Jupyter environment, we will install and configure Terraform in our environment, and in a later notebook, we will install and configure Ansible in this same environment. This is a one-time step that an engineer would ordinarily do just once, when “onboarding”, on their own computer.

Note: This is a Bash notebook, so you will run it with a Bash kernel. You can change the kernel (if needed) by clicking the kernel name in the top right of the Jupyter interface.

Get infrastructure configuration

Following IaC principles, our infrastructure configuration is all in version control! We have organized all of the materials that “describe” the deployment in our “IaC repository”: https://github.com/teaching-on-testbeds/gourmetgram-iac.git.

This repository has the following structure:

├── tf
│   └── kvm
├── ansible
│   ├── general
│   ├── pre_k8s
│   ├── k8s
│   ├── post_k8s
│   └── argocd
├── k8s
│   ├── platform
│   ├── staging
│   ├── canary
│   └── production
└── workflows

In the next cell, we get a copy of the GourmetGram infrastructure repository:

# runs in Chameleon Jupyter environment
git clone --recurse-submodules https://github.com/teaching-on-testbeds/gourmetgram-iac.git /work/gourmetgram-iac

Note that we use the --recurse-submodules argument to git clone - we are including Kubespray, an Ansible-based project for deploying Kubernetes, inside our IaC repository as a submodule.

Among the automation and CI/CD tools mentioned above:

So, a necessary prerequisite for this workflow is to download, install, and configure Terraform and Ansible on “our own computer” - except in this case, we will use the Chameleon Jupyter environment as “our computer”.

Install and configure Terraform

Before we can use Terraform, we’ll need to download a Terraform client. The following cell will download the Terraform client and “install” it in this environment:

# runs in Chameleon Jupyter environment
mkdir -p /work/.local/bin
wget https://releases.hashicorp.com/terraform/1.14.4/terraform_1.14.4_linux_amd64.zip
unzip -o -q terraform_1.14.4_linux_amd64.zip
mv terraform /work/.local/bin
rm terraform_1.14.4_linux_amd64.zip

The Terraform client has been installed to: /work/.local/bin. In order to run terraform commands, we will have to add this directory to our PATH, which tells the system where to look for executable files.

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH

Let’s make sure we can now run terraform commands. The following cell should print usage information for the terraform command, since we run it without any subcommands:

# runs in Chameleon Jupyter environment
terraform

Terraform works by communicating with a cloud provider (either a commercial cloud, like AWS or GCP, or a private cloud, like an on-premises OpenStack cloud, or a hybrid cloud with both types of resources). We will need to prepare credentials with which it can act on our behalf on the Chameleon OpenStack cloud. This is a one-time procedure.

To get credentials, open the Horizon GUI:

On the left side, expand the “Identity” section and click on “Application Credentials”. Then, click “Create Application Credential”.

The clouds.yaml file will look something like this (expect with an alphanumeric string in place of REDACTED_UNIQUE_ID and REDACTED_SECRET):

clouds:
  openstack:
    auth:
      auth_url: https://kvm.tacc.chameleoncloud.org:5000
      application_credential_id: "REDACTED_UNIQUE_ID"
      application_credential_secret: "REDACTED_SECRET"
    region_name: "KVM@TACC"
    interface: "public"
    identity_api_version: 3
    auth_type: "v3applicationcredential"

It lists one or more clouds - in this case, a single cloud named “openstack”, and then for each cloud, specifies how to connect and authenticate to that cloud. In particular, the application_credential_id and application_credential_secret allow an application like Terraform to interact with the Chameleon cloud on your behalf, without having to use your personal Chameleon login.

Then, in our Terraform configuration, we will have a block like

provider "openstack" {
  cloud = "openstack"
}

where the value assigned to cloud tells Terraform which cloud in the clouds.yaml file to authenticate to.

One nice feature of Terraform is that we can use it to provision resource on multiple clouds. For example, if we wanted to provision resources on both KVM@TACC and CHI@UC (e.g. the training resources on CHI@UC and everything else on KVM@TACC), we might generate application credentials on both sites, and combine them into a clouds.yaml like this:

clouds:
  kvm:
    auth:
      auth_url: https://kvm.tacc.chameleoncloud.org:5000
      application_credential_id: "REDACTED_UNIQUE_ID_KVM"
      application_credential_secret: "REDACTED_SECRET_KVM"
    region_name: "KVM@TACC"
    interface: "public"
    identity_api_version: 3
    auth_type: "v3applicationcredential"
  uc:
    auth:
      auth_url: https://chi.uc.chameleoncloud.org:5000
      application_credential_id: "REDACTED_UNIQUE_ID_UC"
      application_credential_secret: "REDACTED_SECRET_UC"
    region_name: "CHI@UC"
    interface: "public"
    identity_api_version: 3
    auth_type: "v3applicationcredential"

and then in our Terraform configuration, we could multiple OpenStack clouds to use, e.g. have

provider "openstack" {
  alias = "kvm"
  cloud = "kvm"
}

provider "openstack" {
  alias = "uc"
  cloud = "uc"
}

and in resource definitions, either

provider = openstack.kvm

or

provider = openstack.uc

to indicate which cloud to provision on.

For now, since we are just using one cloud, we will leave our clouds.yaml as is.

In the file browser in the Chameleon Jupyter environment, you will see a template clouds.yaml. Use the file browser to open it, and paste in the

      application_credential_id: "REDACTED_UNIQUE_ID"
      application_credential_secret: "REDACTED_SECRET"

lines from the clouds.yaml that you just downloaded from the KVM@TACC GUI (so that it has the “real” credentials in it). Save the file.

Terraform will look for the clouds.yaml in either ~/.config/openstack or the directory from which we run terraform - we will move it to the latter directory:

# runs in Chameleon Jupyter environment
cp clouds.yaml /work/gourmetgram-iac/tf/kvm/clouds.yaml

The Terraform executable has been installed to a location that is not the system-wide location for executable files: /work/.local/bin. In order to run terraform commands, we will have to add this directory to our PATH, which tells the system where to look for executable files.

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH

and, we’ll have to do that in each new Bash session.

Provision infrastructure with Terraform

Now that everything is set up, we are ready to provision our VM resources with Terraform! We will use Terraform to provision 3 VM instances and associated network resources on the OpenStack cloud.

Using Terraform to provision resources.

Create a server lease

While Terraform is able to provision most kinds of resources, it cannot create or manage a reservation. The reservation feature of OpenStack is not used very widely (outside of Chameleon), and the Terraform provider for OpenStack does not yet support it. We will separately create a lease for three server instances outside of Terraform.

Authentication

In the cell below, replace CHI-XXXXXX with the name of your Chameleon project, then run the cell.

# runs in Chameleon Jupyter environment
export OS_AUTH_URL=https://kvm.tacc.chameleoncloud.org:5000/v3
export OS_PROJECT_NAME="CHI-XXXXXX"
export OS_REGION_NAME="KVM@TACC"

and in BOTH cells below, replace netID with your own net ID, then run to request a lease:

# runs in Chameleon Jupyter environment
# replace netID in this line
openstack reservation lease create lease_mlops_netID \
  --start-date "$(date -u -d '+10 seconds' '+%Y-%m-%d %H:%M')" \
  --end-date "$(date -u -d '+12 hours' '+%Y-%m-%d %H:%M')" \
  --reservation "resource_type=flavor:instance,flavor_id=$(openstack flavor show m1.large -f value -c id),amount=3"

and print the UUID of the reserved “flavor” (again, replace netID with your own):

# runs in Chameleon Jupyter environment
# also replace netID in this line
flavor_id=$(openstack reservation lease show lease_mlops_netID -f json -c reservations \
      | jq -r '.reservations[0].flavor_id')
echo $flavor_id

Make a note of this reservation ID - you will need it later, to provision resources.

Preliminaries

Let’s navigate to the directory with the Terraform configuration for our KVM deployment:

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/tf/kvm

and make sure we’ll be able to run the terraform executable by adding the directory in which it is located to our PATH:

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH

We also need to un-set some OpenStack-related environment variables that are set automatically in the Chameleon Jupyter environment, since these will override some Terraform settings that we don’t want to override:

# runs in Chameleon Jupyter environment
unset $(set | grep -o "^OS_[A-Za-z0-9_]*")

We should also check that our clouds.yaml is set up:

# runs in Chameleon Jupyter environment
cat  clouds.yaml

Understanding our Terraform configuration

The tf/kvm directory in our IaC repository includes the following files, which we’ll briefly discuss now.

├── data.tf
├── main.tf
├── outputs.tf
├── provider.tf
├── variables.tf
└── versions.tf

A Terraform configuration defines infrastructure elements using stanzas, which include different components such as

We’ll focus especially on data sources, resources, outputs, and variables. Here’s an example of a Terraform configuration that includes all four:

resource "openstack_compute_instance_v2" "my_vm" {
  name            = "${var.instance_hostname}"
  flavor_name     = "m1.small"
  image_id        = data.openstack_images_image_v2.ubuntu.id
  key_pair        = "my-keypair"
  network {
    name = "private-network"
  }
}

data "openstack_images_image_v2" "ubuntu" {
  name = "CC-Ubuntu24.04"
}

variable "instance_hostname" {
  description = "Hostname to use for the image"
  type        = string
  default     = "example-vm"
}

output "instance_ip" {
  value = openstack_compute_instance_v2.my_vm.access_ip_v4
}

Each item is in a stanza which has a block type, an identifier, and a body enclosed in curly braces {}. For example, the resource stanza for the OpenStack instance above has the block type resource, the resource type openstack_compute_instance_v2, and the name my_vm. (This name can be anything you want - it is used to refer to the resource elsewhere in the configuration.) Inside the body, we would specify attributes such as flavor_name, image_id, and network (you can see a complete list in the documentation).

The data sources, variables, and resources are used to define and manage infrastructure.

You may notice the use of for_each in main.tf. This is used to iterate over a collection, such as a map variable, to create multiple instances of a resource. Since for_each assigns unique keys to each element, that makes it easier to reference specific resources. For example, we provision a port on sharednet1 for each instance, but when we assign a floating IP, we can specifically refer to the port for “node1” with openstack_networking_port_v2.sharednet1_ports["node1"].id.

Terraform also supports outputs, which provide information about the infrastructure after deployment. For example, if we want to print a dynamically assigned floating IP after the infrastructure is deployed, we might put it in an output. This will save us from having to look it up in the Horizon GUI. You can see in outputs.tf that we do exactly this.

Terraform is declarative, not imperative, so we don’t need to write the exact steps needed to provision this infrastructure - Terraform will examine our configuration and figure out a plan to realize it.

Applying our Terraform configuration

First, we need Terraform to set up our working directory, make sure it has “provider” plugins to interact with our infrastructure provider (it will read in provider.tf to check), and set up storage for keeping track of the infrastructure state:

# runs in Chameleon Jupyter environment
terraform init

We need to set some variables. In our Terraform configuration, we define a variable named suffix that we will substitute with our own net ID, and then we use that variable inside the hostname of instances and the names of networks and other resources in main.tf, e.g. we name our network private-subnet-mlops-${var.suffix}. We’ll also use a variable to specify a key pair to install.

In the following cell, replace netID with your actual net ID, replace id_rsa_chameleon with the name of your personal key that you use to access Chameleon resources, and replace the all-zero ID with the reservation ID you printed above..

# runs in Chameleon Jupyter environment
export TF_VAR_suffix=netID
export TF_VAR_key=id_rsa_chameleon
export TF_VAR_reservation=00000000-0000-0000-0000-000000000000

You’ll use Terraform again at the end of this experiment to delete your Terraform-managed resources, and you’ll need these variables again then. Open the last notebook in the series and copy/paste the values in the cell above into the equivalent cell there.

We should confirm that our planned configuration is valid:

# runs in Chameleon Jupyter environment
terraform validate

Then, let’s preview the changes that Terraform will make to our infrastructure. In this stage, Terraform communicates with the cloud infrastructure provider to see what we have already deployed, and to determine what it needs to do to realize the requested configuration:

# runs in Chameleon Jupyter environment
terraform plan

Notice that at this stage, Terraform has e.g. read in the IDs of the security groups we defined in data blocks. If we e.g. asked it to read in a security group names “allow-XXX” and there was no such security group in the project, it would warn us in the plan output.

Finally, we will apply those changes. (We need to add an -auto-approve argument because ordinarily, Terraform prompts the user to type “yes” to approve the changes it will make.)

# runs in Chameleon Jupyter environment
terraform apply -auto-approve

Make a note of the floating IP assigned to your instance, from the Terraform output.

From the KVM@TACC Horizon GUI, check the list of compute instances and find yours. Take a screenshot for later reference.

Changing our infrastructure

One especially nice thing about Terraform is that if we change our infrastructure definition, it can apply those changes without having to re-provision everything from scratch.

For example, suppose the physical node on which our “node3” VM becomes non-functional. To replace our “node3”, we can simply run

# runs in Chameleon Jupyter environment
terraform apply -replace='openstack_compute_instance_v2.nodes["node3"]' -auto-approve

Similarly, we could make changes to the infrastructure description in the main.tf file and then use terraform apply to update our cloud infrastructure. Terraform would determine which resources can be updated in place, which should be destroyed and recreated, and which should be left alone.

This declarative approach - where we define the desired end state and let the tool get there - is much more robust than imperative-style tools for deploying infrastructure (openstack CLI, python-chi Python API) (and certainly more robust than ClickOps!).

Install and configure Ansible

Next, we’ll set up Ansible! We will need to get the Ansible client, which we install in the following cell:

# runs in Chameleon Jupyter environment
PYTHONUSERBASE=/work/.local pip install --user ansible-core==2.16.9 ansible==9.8.0

The Ansible client has been installed to: /work/.local/bin. In order to run ansible-playbook commands, we will have to add this directory to our PATH, which tells the system where to look for executable files. We also need to let it know where to find the corresponding Python packages.

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH
export PYTHONUSERBASE=/work/.local

Let’s make sure we can now run ansible-playbook commands. The following cell should print usage information for the ansible-playbook command, since we run it with --help:

# runs in Chameleon Jupyter environment
ansible-playbook --help

Now, we’ll configure Ansible. The ansible.cfg configuration file modifies the default behavior of the Ansible commands we’re going to run. Open this file using the file browser on the left side.

Our configuration will include:

[defaults]
stdout_callback = yaml
inventory = /work/gourmetgram-iac/ansible/inventory.yaml

The first line is just a matter of preference, and directs the Ansible client to display output from commands in a more structured, readable way. The second line specifies the location of a default inventory file - the list of hosts that Ansible will configure.

It will also include:

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s \
           -o StrictHostKeyChecking=off -o UserKnownHostsFile=/dev/null \
           -o ForwardAgent=yes \
           -o ProxyCommand="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -W %h:%p cc@A.B.C.D"
pipelining = True

which says that when Ansible uses SSH to connect to the resources it is managing, it should “jump” through A.B.C.D and forward the keys from this environment, through A.B.C.D, to the final destination. (Also, we disable host key checking when using SSH, and configure it to minimize the number of SSH sessions and the number of network operations wherever possible.)

Now that you have provisioned resources, edit A.B.C.D., and replace it with the floating IP assigned to your experiment. Then, save the updated config file.

Ansible will look in either ~/.ansible.cfg or the directory that we run Ansible commands from, we will use the latter:

# runs in Chameleon Jupyter environment
cp ansible.cfg /work/gourmetgram-iac/ansible/ansible.cfg

Configure the PATH

Both Terraform and Ansible executables have been installed to a location that is not the system-wide location for executable files: /work/.local/bin. In order to run terraform or ansible-playbook commands, we will have to add this directory to our PATH, which tells the system where to look for executable files.

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH
export PYTHONUSERBASE=/work/.local

and, we’ll have to do that in each new Bash session.

Prepare Kubespray

To install Kubernetes, we’ll use Kubespray, which is a set of Ansible playbooks for deploying Kubernetes. We’ll also make sure we have its dependencies now:

# runs in Chameleon Jupyter environment
PYTHONUSERBASE=/work/.local pip install --user -r /work/gourmetgram-iac/ansible/k8s/kubespray/requirements.txt

Practice using Ansible

Now that we have provisioned some infrastructure, we can configure and install software on it using Ansible!

Ansible is a tool for configuring systems by accessing them over SSH and running commands on them. The commands to run will be defined in advance in a series of playbooks, so that instead of using SSH directly and then running commands ourselves interactively, we can just execute a playbook to set up our systems.

First, let’s just practice using Ansible.

Preliminaries

As before, let’s make sure we’ll be able to use the Ansible executables. We need to put the install directory in the PATH inside each new Bash session.

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH
export PYTHONUSERBASE=/work/.local

If you haven’t already, make sure to put your floating IP (which you can see in the output of the Terraform command!) in the ansible.cfg configuration file, and move it to the specified location.

The following cell will show the contents of this file, so you can double check - make sure your real floating IP is visible in this output!

# runs in Chameleon Jupyter environment
cat /work/gourmetgram-iac/ansible/ansible.cfg

Finally, we’ll cd to that directory -

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible

Verify connectivity

First, we’ll run a simple task to check connectivity with all hosts listed in the inventory.yml file:

all:
  vars:
    ansible_python_interpreter: /usr/bin/python3
  hosts:
    node1:
      ansible_host: 192.168.1.11
      ansible_user: cc
    node2:
      ansible_host: 192.168.1.12
      ansible_user: cc
    node3:
      ansible_host: 192.168.1.13
      ansible_user: cc

It uses the ping module, which checks if Ansible can connect to each host via SSH and run Python code there.

# runs in Chameleon Jupyter environment
ansible -i inventory.yml all -m ping

Run a “Hello, World” playbook

Once we have verified connectivity to the nodes in our “inventory”, we can run a playbook, which is a sequence of tasks organized in plays, and defined in a YAML file. Here we will run the following playbook with one “Hello world” play:

---
- name: Hello, world - use Ansible to run a command on each host
  hosts: all
  gather_facts: no

  tasks:
    - name: Run hostname command
      command: hostname
      register: hostname_output

    - name: Show hostname output
      debug:
        msg: "The hostname of  is "

The playbook connects to all hosts listed in the inventory, and performs two tasks: first, it runs the hostname command on each host and saves the result in hostname_output, then it prints a message showing the value of hostname_output (using the debug module).

# runs in Chameleon Jupyter environment
ansible-playbook -i inventory.yml general/hello_host.yml

Deploy Kubernetes using Ansible

Now that we understand a little bit about how Ansible works, we will use it to deploy Kubernetes on our three-node cluster!

We will use Kubespray, an Ansible-based tool, to automate this deployment.

Using Ansible for software installation and system configuration.

Preliminaries

As before, let’s make sure we’ll be able to use the Ansible executables. We need to put the install directory in the PATH inside each new Bash session.

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH
export PYTHONUSERBASE=/work/.local

Run a preliminary playbook

Before we set up Kubernetes, we will run a preliminary playbook to:

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml pre_k8s/pre_k8s_configure.yml

Run the Kubespray play

Then, we can run the Kubespray playbook! Inside the ansible/k8s subdirectory:

The following cell will run for a long time - potentially up to an hour! - and install Kubernetes on the three-node cluster.

When it is finished the “PLAY RECAP” should indicate that none of the tasks failed.

# runs in Chameleon Jupyter environment
export ANSIBLE_CONFIG=/work/gourmetgram-iac/ansible/ansible.cfg
export ANSIBLE_ROLES_PATH=roles
# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible/k8s/kubespray
ansible-playbook -i ../inventory/mycluster --become --become-user=root ./cluster.yml

Run a post-install playbook

After our Kubernetes install is complete, we run some additional tasks to further configure and customize our Kubernetes deployment. Our post-install playbook will:

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH
export PYTHONUSERBASE=/work/.local

In the output below, make a note of the Kubernetes dashboard token and the Argo admin password, both of which we will need in the next steps.

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml post_k8s/post_k8s_configure.yml

Access the Kubernetes dashboard

To check on our Kubernetes deployment, let’s keep an eye on the dashboard.

First, since we did not configure security group rules to permit any ports besides SSH, we need to use SSH port forwarding to open a tunnel between our local device and the remote cluster. Then, since the service is configured only for internal access within the cluster, we need to use port forwarding to also make it available on the host.

Run the command below in your local terminal (not the terminal in the Chameleon Jupyter environment!) and substitute:

# runs in your **local** terminal
ssh -L 8443:127.0.0.1:8443 -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

then, inside that terminal, run

# runs on node1 
kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443

and leave it running.

Now, in a browser, you may open

https://127.0.0.1:8443/

You will see a warning about an invalid certificate, which you may override and choose the “Advanced” option to proceed. Then, you will be prompted to log in.

From the output of the post-install playbook above, find the “Dashboard token” and paste it into the token space, then log in. You will see the Kubernetes dashboard.

(Note: if your token expires, you can generate a new one with kubectl -n kube-system create token admin-user.)

For now, there is not much of interest in the dashboard. You can see some Kubernetes system services in the “kube-system” namespace, and Argo-related services in the “argo”, “argocd”, and “argo-events” namespaces. We have not yet deployed our GourmetGram services, but we’ll do that in the next step!

Access the ArgoCD dashboard

Similarly, we may access the Argo CD dashboard. In the following command, substitute

# runs in your **local** terminal
ssh -L 8888:127.0.0.1:8888 -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

then, inside that terminal, run

# runs on node1 
kubectl port-forward svc/argocd-server -n argocd 8888:443

and leave it running.

Now, in a browser, you may open

https://127.0.0.1:8888/

You will see a warning about an invalid certificate, which you may override and choose the “Advanced” option to proceed. Then, you will be prompted to log in.

From the output of the post-install playbook above, find the “ArgoCD Password” and paste it into the password space, use admin for the username, then log in.

For now, there is not much of interest in Argo CD. We have not yet configured Argo with for any deployments, but we’ll do that in the next step!

Access the Argo Workflows dashboard

Finally, we may access the Argo Workflows dashboard. In the following command, substitute

# runs in your **local** terminal
ssh -L 2746:127.0.0.1:2746 -i ~/.ssh/id_rsa_chameleon cc@A.B.C.D

then, inside that terminal, run

# runs on node1 
kubectl -n argo port-forward svc/argo-server 2746:2746

and leave it running.

Now, in a browser, you may open

https://127.0.0.1:2746/

You will see a warning about an invalid certificate, which you may override and choose the “Advanced” option to proceed. Then, you will be able to see the Argo Workflows dashboard.

Again, there is not much of interest - but there will be, soon.

Use ArgoCD to manage applications on the Kubernetes cluster

With our Kubernetes cluster up and running, we are ready to deploy applications on it!

We are going to use ArgoCD to manage applications on our cluster. ArgoCD monitors “applications” that are defined as Kubernetes manifests in Git repositories. When the application manifest changes (for example, if we increase the number of replicas, change a container image to a different version, or give a pod more memory), ArgoCD will automatically apply these changes to our deployment.

ArgoCD itself will manage the application lifecycle once started. But to set up our applications in ArgoCD in the first place, we are going to use Ansible as a configuration tool. So, in this notebook we run a series of Ansible playbooks to set up ArgoCD applications.

Using ArgoCD for apps and services.

# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH
export PYTHONUSERBASE=/work/.local
export ANSIBLE_CONFIG=/work/gourmetgram-iac/ansible/ansible.cfg
export ANSIBLE_ROLES_PATH=roles

First, we will deploy our GourmetGram “platform”. This has all the “accessory” services we need to support our machine learning application.

In our example, it has

More generally, “platform” may include other shared services used by multiple teams, including experiment tracking, evaluation and monitoring, and similar related services.

There are a couple of “complications” we need to manage as part of this deployment:

Dynamic environment-specific customization: as in Cloud Computing on Chameleon, we want to specify the externalIPs on which our ClusterIP services should be available. However, we only know the IP address of the “head” node on the Internet-facing network after the infrastructure is deployed.

Furthermore, Argo CD gets our service definitions from a Git repository, and we don’t want to modify the externalIPs in GitHub each time we deploy our services.

To address this, we deploy our services using Helm, a tool that automates the creation, packaging, configuration, and deployment of Kubernetes applications. With Helm, we can include something like this in our Kubernetes manifest/Helm chart:

  externalIPs:
    - 

and then when we add the application to ArgoCD, we pass the value that should be filled in there:

        --helm-set-string minio.externalIP= 

where Ansible finds out the value of external_ip for us in a separate task:

    - name: Detect external IP starting with 10.56
      set_fact:
        external_ip: ""

This general pattern:

can be applied to a wide variety of environment-specific configurations. It can also be used anything that shouldn’t be included in a Git repository. For example: if your deployment needs a secret application credential, you can store in a separate .env file that is available to your Ansible client (not in a Git repository), get Ansible to read it into a variable, and then use ArgoCD + Helm to substitute that secret where needed in your Kubernetes application definition.

Deployment with secrets: our deployment includes some services that require authentication, e.g. the object store (MinIO) and database (Postgres) used by the model registry. We don’t want to include passwords or other secrets in our Git repository, either! To address this, we will have Ansible generate a secret password and register it with Kubernetes, e.g.:

- name: Generate MinIO secret key
    when: minio_secret_check.rc != 0
    set_fact:
    minio_secret_key: ""

- name: Create MinIO credentials secret
    when: minio_secret_check.rc != 0
    command: >
    kubectl create secret generic minio-credentials
    --namespace gourmetgram-platform
    --from-literal=accesskey=
    --from-literal=secretkey=
    register: minio_secret_create

and then in our Kubernetes manifests, we can use this secret without explicitly specifying its value, e.g.:

env:
- name: MINIO_ROOT_USER
    valueFrom:
    secretKeyRef:
        name: minio-credentials
        key: accesskey
- name: MINIO_ROOT_PASSWORD
    valueFrom:
    secretKeyRef:
        name: minio-credentials
        key: secretkey

This general pattern can similarly be applied more broadly to any applications and services that require a secret.

Let’s add the gourmetgram-platform application now.

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml argocd/argocd_add_platform.yml

Once the platform is deployed, we can open the MLFlow model registry on http://A.B.C.D:8000 (substitute your own floating IP), and click on the “Models” tab.

We haven’t “trained” any model yet, but when we do, they will appear here.

Next, we need to deploy the GourmetGram application.

During regular operation, CI will build the container image for the application. But to bootstrap the deployment, we will build it ourselves. We will run a one-time workflow in Argo Workflows to build the initial container images for the “staging”, “canary”, and “production” environments:

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml argocd/workflow_build_init.yml

Look at the workflow YAML here, which defines each step of the container image build job.

Follow along in the Argo Workflows dashboard as it runs - you can see each stage as a node in a DAG, and you can click on a node to see its logs.

Also build the training container image, which will be used as part of the pipeline when we run a training job later:

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml argocd/workflow_build_training_init.yml

Now that we have a container image, we can deploy our application to three environments -

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml argocd/argocd_add_staging.yml
# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml argocd/argocd_add_canary.yml
# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml argocd/argocd_add_prod.yml

At this point, you can also revisit the dashboards you opened earlier:

Take a screenshot of the ArgoCD dashboard for your reference.

Test your staging, canary, and production deployments - in the Kubernetes Service definition, we have put them on different ports. For now, they are all running exactly the same model!

Now, Argo CD is constantly comparing the state of our application according to the Kubernetes manifests/Helm charts in Github, vs. the actual state on the cluster, and trying to reconcile them.

In the next section, we will manage our application lifecycle with Argo Worfklows. To help with that, we’ll apply some workflow templates from Ansible, so that they are ready to go in the Argo Workflows UI:

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/ansible
ansible-playbook -i inventory.yml argocd/workflow_templates_apply.yml

Argo will manage the lifecycle from here on out: Argo CD for CD:

Using ArgoCD for apps and services.

and Argo Workflows for CI.

Model and application lifecycle - Part 1

With all of the pieces in place, we are ready to follow a GourmetGram model through its lifecycle!

We will start with the first stage, where:

Part 1 of the ML model lifecycle: from training to new container image.

The training procedure

When triggered, model training runs as a Kubernetes pod managed by Argo Workflows. The workflow first checks if the training code has changed (by comparing git commits) or if there is not already a training container image, and builds the training container image if needed.

The training script (flow.py) inside the training container performs several steps:

  1. Emulate training: For this demo, it loads a pre-trained model checkpoint as a fake “training” step
  2. Run pytest tests: Evaluates the model using automated tests. The script uses pytest to evaluate the model. Tests are organized in a tests/ directory, and pytest runs them, capturing the output which is logged to MLflow as an artifact alongside the model.
  3. Register model: If tests pass, registers the model in the MLFlow model registry, and records the version number
  4. Handle failures: If tests fail, prints detailed output to logs and exits with an error code

Note that our “test suite” has tests organized into two files:

Run a training job

We have already set up an Argo workflow template to run the training job. If you have the Argo Workflows dashboard open, you can see it by:

We will use this as an example to understand how an Argo Workflow template is developed. An Argo Workflow is defined as a sequence of steps in a graph.

At the top, we have some basic metadata about the workflow:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: train-model

then, the name of the first “node” in the graph (training-and-build in this example). This workflow accepts a branch parameter to specify which branch of the training repository to use (defaults to mlops):

spec:
  entrypoint: training-and-build
  arguments:
    parameters:
    - name: branch
      value: mlops

Now, we have a sequence of steps that run in order:

  templates:
  - name: training-and-build
    steps:
      # Step 1: Clone repo and check if code has changed
      - - name: check-code-changes
          template: check-and-clone
          arguments:
            parameters:
            - name: branch
              value: ""
      # Step 2: Conditionally rebuild training image if code changed
      - - name: rebuild-training-image
          template: buildkit-rootless
          arguments:
            parameters:
            - name: git-commit
              value: ""
          when: "'' == 'true'"
      # Step 3: Run training
      - - name: run-training
          template: run-training
      # Step 4: Tag model as development in MLflow (latest trained)
      - - name: tag-model-development
          template: set-development-alias
          arguments:
            parameters:
            - name: model-version
              value: ""
          when: "'' != ''"
      # Step 5: Trigger app container build if model was registered
      - - name: build-container
          template: trigger-build
          arguments:
            parameters:
            - name: model-version
              value: ""
          when: "'' != ''"

The workflow has five steps:

  1. check-code-changes: Clones the training repo and checks if the code has changed by comparing git commit hashes with a label on the existing image in the registry. If using a non-default branch, it always rebuilds.

  2. rebuild-training-image: Conditionally rebuilds the training container image if code changed (or if using a non-default branch). This step is skipped if the image is up-to-date.

  3. run-training: Runs the training script in the (possibly just-built) training container.

  4. tag-model-development: Updates the MLflow registered-model alias development to point at the newly-registered model version.

  5. build-container: Triggers the app container build workflow, but only if a model version was successfully registered (i.e., tests passed).

This design ensures the training image is always current without requiring a separate manual build step.

We can look more closely at the run-training step, which runs the training container as a Kubernetes pod:

  - name: run-training
    outputs:
      parameters:
      - name: modelversion
        valueFrom:
          path: /var/run/argo/outputs/parameters/modelversion
    container:
      image: registry.kube-system.svc.cluster.local:5000/gourmetgram-train:latest
      command: [sh, -c]
      args:
        - |
          set -eu
          python flow.py

          # Argo requires output parameters to be written
          # under /var/run/argo/outputs/parameters/
          mkdir -p /var/run/argo/outputs/parameters
          if [ -f /tmp/model_version ]; then
            cp /tmp/model_version /var/run/argo/outputs/parameters/modelversion
          else
            # Avoid hard failure if training didn't emit a version.
            : > /var/run/argo/outputs/parameters/modelversion
          fi
      env:
        - name: MLFLOW_TRACKING_URI
          value: "http://mlflow.gourmetgram-platform.svc.cluster.local:8000"

This part:

If tests pass and a model is successfully registered, the training script writes the model version to /tmp/model_version. Otherwise, it writes an empty file. Either way, subsequent steps can access it.

Note that if pytest tests fail (for example, if the random “accuracy” test returns a value below the threshold), the run-training step will fail with a non-zero exit code. When this happens:

The pipeline gates model registration and deployment to “staging” on passing tests.

Finally, we can see the trigger-build part:

  - name: trigger-build
    inputs:
      parameters:
      - name: model-version
    resource:
      action: create
      manifest: |
        apiVersion: argoproj.io/v1alpha1
        kind: Workflow
        metadata:
          generateName: build-container-image-
        spec:
          workflowTemplateRef:
            name: build-container-image
          arguments:
            parameters:
            - name: model-version
              value: ""

This template uses a resource with action: create to trigger a new workflow - our “build-container-image” workflow! (You’ll see that one shortly.)

Note that we pass along the model-version parameter from the training step to the container build step, so that the container build step knows which model version to use.

Now, we can submit this workflow! In Argo:

This will start the training workflow.

In Argo, you can watch the workflow progress in real time:

You can click on any step to see its logs, inputs, outputs, etc. For example, click on the “run-training” node to see the training logs. You should see pytest output showing which tests passed or failed.

Wait for it to finish. (It may take 10-15 minutes for the entire pipeline to complete, including the container build.)

If your training run fails because its “accuracy” is not good enough, resubmit until you have a passing run! (If it passes, you can also try a few more runs to get it to fail, so you can see what happens.)

Check the model registry

After training completes successfully (and tests pass), you should see a new model version registered in MLflow. Open the MLFlow UI at http://A.B.C.D:8000 (substituting your floating IP address).

Take a screenshot for your reference.

Next: Container build

When training completes successfully, the workflow automatically triggers the process to build a new container image for the GourmetGram application, with the updated model baked in. In the next section, we’ll examine how that container build workflow:

  1. Clones the application repository
  2. Downloads the model from MLflow model registry
  3. Builds a new container image with the updated model
  4. Deploys to the staging environment

This completes Part 1 of the model lifecycle!

Model and application lifecycle - Part 2

Once we have a container image, the progression through the model/application lifecycle continues as the new version is promoted through different environments:

Part 2 of the ML model lifecycle: from staging to production.

Verify that the new model is deployed to staging

Our build-container-image workflow automatically triggers two workflows if successful:

  1. deploy-container-image: Updates the staging deployment via ArgoCD
  2. test-staging: Runs automated tests against the staging deployment

In Argo Workflows:

Then, open the staging service:

This version of the gourmetgram app has a versions endpoint. So you can visit http://A.B.C.D:8082/version, and you should see the model version you just promoted to staging.

Note on our deployment approach: In usual GitOps workflows, the deploy-container-image workflow would:

  1. Update the Helm chart or Kubernetes manifest in Git to specify the new container image tag
  2. Commit and push the change to the Git repository
  3. ArgoCD would detect the Git change and automatically sync the deployment

This makes Git the “single source of truth” for infrastructure state. However, for this lab environment, to avoid requiring all students to:

We instead use a simplified approach where the workflow directly calls ArgoCD’s API to update the deployment. This bypasses Git and directly modifies the ArgoCD application’s Helm values. For demos and learning environments this is fine, but real systems should use the Git-based approach.

Automated testing in staging

Before promoting a model to the canary or production environment - where real users will interact with it! - we should validate that:

  1. The model works correctly with the application code (integration testing)

That’s exactly what the test-staging workflow does! You can check the logs to see the results.

After running the integration test, the workflow branches based on results. This is a key concept in MLOps: automated decision-making based on test outcomes.

# From test-staging.yaml
steps:
  # ... tests run sequentially ...

  # Step 2: Mark as approved if integration test passes
  - - name: mark-staging-approved
      template: set-staging-approved
      when: "'' == 'pass'"

  # Step 3: Branching based on test results
  - - name: promote-on-success
      template: trigger-promote
      when: "'' == 'pass'"

There are two possible outcomes:

  1. All tests pass:
    • Model gets staging-approved alias in MLflow. In case we need to revert to this model after testing a later version, we know that it is “known good” (in staging, at least).
    • Automatically trigger promote-model workflow to deploy the successful container image to the canary environment
  2. Integration test fails: The workflow fails and no promotion happens

This branching is implemented using Argo Workflows’ when conditions. Each branch is evaluated independently, and only the matching branch executes.

Observing automated promotion (happy path)

In the Argo Workflows UI, watch the test-staging workflow after a successful staging deployment:

  1. integration-test step runs - logs should show ✓ PASSED
  2. promote-on-success step triggers - creates a new promote-model workflow

Click on the new promote-model workflow to watch it execute:

  1. Retags the container image from staging-1.0.X to canary-1.0.X
  2. Updates the MLFlow alias from “staging” to “canary”. The “staging”, “canary”, or “production” alias reflects the environment in which the model is currently deployed (if any)
  3. Triggers ArgoCD to sync the canary deployment

After the workflow completes, verify the promotion:

In the MLFlow UI:

Take screenshots of:

  1. The completed test-staging workflow showing all tests passed
  2. The triggered promote-model workflow
  3. The canary /version endpoint showing the new version
  4. The MLFlow UI showing the “canary” alias

Promotion to production

Until now, we have directly accessed different versions of our service in different stages by changing the port number; we put each service on a different port. Users, however, will access our service on the standard port (port 80 for HTTP service) and, as part of our “platform”, we have a service that routes 10% of requests to the canary service, and the remaining 90% to the production service.

Try this for yourself - visit http://A.B.C.D/version (using your own public IP) repeatedly, and observe that sometimes you get the production service; sometimes you get the canary service.

After some careful monitoring in canary with real users, the model may be promoted to a “production” environment. Let’s do that, too. From the Argo Workflows UI, find the promote-model workflow template and click “Submit”.

Then, run the workflow. Check the version that is deployed to the “production” environment (http://A.B.C.D:8080/version) to verify.

Take a screenshot, with both the address bar showing the URL and the response showing the version number visible in the screenshot. Also, take a screenshot of the updated list of model versions in the MLFlow UI (the alias list will have changed!).

Model and application lifecycle - Part 3

So far, you mostly saw the pipeline working when everything goes well (with the exception of some random “accuracy” failures). Now you’re going to deliberately push a “bad” model through and watch staging protect you.

Failing an integration test

We’re going to train the model using a different branch of the “gourmetgram-train” repo. In this branch, only the model state dictionary is saved to the .pth file, whereas previously we were saving the full model object. The training tests will pass - they have been updated to reflect the new type of model artifact - but our integration test in the staging environment will fail, because this model artifact is not compatible with the GourmetGram app code that expects a full model object.

Start the training run like you did before, but change the branch:

  1. In the Argo Workflows UI, open “Workflow Templates” and click train-model.
  2. Click “Submit”, set the branch parameter to mlops-bad, and submit.
  3. Wait for the run to finish and for the downstream workflows to run. You’re looking for the staging test workflow (usuly named something like test-staging) that runs after the staging deployment.

Once the staging tests run, open the test-staging workflow and click into the integration test step. Read the logs and confirm that the integration check failed. The pod running the new model will crash each time it is loaded. The integration test checks to confirm (among other things) that the service is running the expected model version; this will fail because there will not be a running pod.

Verify that the broken model is not promoted to “canary” by visiting http://A.B.C.D:8081/version (replace A.B.C.D with your floating IP). You should see that canary is still using the old “working” model.

Scheduled training

Until now, we have been manually “triggering” each training run. A scheduled training job (cron-train) was already set up in Argo when you applied the other workflow templates. Now, you’ll verify it’s present and working.

  1. In the Argo Workflows UI, go to “Cron Workflows” tab and open cron-train.
  2. Confirm it references the existing train-model workflow template and uses the default training branch (either by relying on the template’s default parameter value, or by explicitly setting the branch parameter).
  3. It is currently set up to train once daily, at 2:00AM UTC. Click on the “Cron” tab and set the schedule to */15 * * * *, which means “Every 15 minutes”. Click the “Update” button.
  4. Go back to the main Workflows view. Wait for the scheduled run to kick off, and confirm it creates new train-model workflows automatically.

Delete infrastructure with Terraform

Since we provisioned our infrastructure with Terraform, we can also delete all the associated resources using Terraform.

# runs in Chameleon Jupyter environment
cd /work/gourmetgram-iac/tf/kvm
# runs in Chameleon Jupyter environment
export PATH=/work/.local/bin:$PATH
# runs in Chameleon Jupyter environment
unset $(set | grep -o "^OS_[A-Za-z0-9_]*")

In the following cell, replace netID with your actual net ID, replace id_rsa_chameleon with the name of your personal key that you use to access Chameleon resources, and replace the all-zero ID with the reservation ID you found earlier..

# runs in Chameleon Jupyter environment
export TF_VAR_suffix=netID
export TF_VAR_key=id_rsa_chameleon
export TF_VAR_reservation=00000000-0000-0000-0000-000000000000
# runs in Chameleon Jupyter environment
terraform destroy -auto-approve