Tutorial Notebook 2: Vetting a Model With Production Experiments
So far, we’ve discussed practices and methods for transitioning an ML model and related artifacts from development to production. However, just the act of pushing a model into production is not the only consideration. In many situations, it’s important to vet a model’s performance in the real world before fully activating it. Real world vetting can surface issues that may not have arisen during the development stage, when models are only checked using hold-out data.
In this notebook, you will learn about two kinds of production ML model validation methods: A/B testing and Shadow Deployments. A/B tests and other types of experimentation are part of the ML lifecycle. The ability to quickly experiment and test new models in the real world helps data scientists to continually learn, innovate, and improve AI-driven decision processes.
Preliminaries
In the blocks below we will preload some required libraries; we will also redefine some of the convenience functions that you saw in the previous notebook.
After that, you should log into Wallaroo and set your working environment to the workspace that you created in the previous notebook.
# preload needed libraries
import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework
from IPython.display import display
# used to display DataFrame information without truncating
from IPython.display import display
import pandas as pd
pd.set_option('display.max_colwidth', None)
import json
import datetime
import time
# used for unique connection names
import string
import random
import pyarrow as pa
import sys
# setting path - only needed when running this from the `with-code` folder.
sys.path.append('../')
from CVDemoUtils import CVDemo
cvDemo = CVDemo()
cvDemo.COCO_CLASSES_PATH = "../models/coco_classes.pickle"
## convenience functions from the previous notebook
# return the workspace called <name>, or create it if it does not exist.
# this function assumes your connection to wallaroo is called wl
def get_workspace(name):
workspace = None
for ws in wl.list_workspaces():
if ws.name() == name:
workspace= ws
if(workspace == None):
workspace = wl.create_workspace(name)
return workspace
# pull a single datum from a data frame
# and convert it to the format the model expects
def get_singleton(df, i):
singleton = df.iloc[i,:].to_numpy().tolist()
sdict = {'tensor': [singleton]}
return pd.DataFrame.from_dict(sdict)
# pull a batch of data from a data frame
# and convert to the format the model expects
def get_batch(df, first=0, nrows=1):
last = first + nrows
batch = df.iloc[first:last, :].to_numpy().tolist()
return pd.DataFrame.from_dict({'tensor': batch})
# Translated a column from a dataframe into a single array
# used for the Statsmodel forecast model
def get_singleton_forecast(df, field):
singleton = pd.DataFrame({field: [df[field].values.tolist()]})
return singleton
Pre-exercise
If needed, log into Wallaroo and go to the workspace that you created in the previous notebook. Please refer to Notebook 1 to refresh yourself on how to log in and set your working environment to the appropriate workspace.
## blank space to log in and go to the appropriate workspace
wl = wallaroo.Client()
import string
import random
workspace_name = f'computer-vision-tutorial'
workspace = get_workspace(workspace_name)
wl.set_current_workspace(workspace)
{'name': 'computer-vision-tutorialjohn', 'id': 20, 'archived': False, 'created_by': '0a36fba2-ad42-441b-9a8c-bac8c68d13fa', 'created_at': '2023-08-04T19:16:04.283819+00:00', 'models': [{'name': 'resnet', 'versions': 2, 'owner_id': '""', 'last_update_time': datetime.datetime(2023, 8, 5, 17, 25, 35, 579362, tzinfo=tzutc()), 'created_at': datetime.datetime(2023, 8, 5, 17, 20, 21, 886290, tzinfo=tzutc())}, {'name': 'mobilenet', 'versions': 5, 'owner_id': '""', 'last_update_time': datetime.datetime(2023, 8, 5, 18, 2, 2, 812811, tzinfo=tzutc()), 'created_at': datetime.datetime(2023, 8, 4, 19, 19, 46, 286247, tzinfo=tzutc())}, {'name': 'cv-post-process-drift-detection', 'versions': 4, 'owner_id': '""', 'last_update_time': datetime.datetime(2023, 8, 5, 18, 2, 4, 862817, tzinfo=tzutc()), 'created_at': datetime.datetime(2023, 8, 4, 19, 23, 10, 189958, tzinfo=tzutc())}], 'pipelines': [{'name': 'cv-retail-pipeline', 'create_time': datetime.datetime(2023, 8, 4, 19, 23, 11, 179176, tzinfo=tzutc()), 'definition': '[]'}]}
A/B Testing
An A/B test, also called a controlled experiment or a randomized control trial, is a statistical method of determining which of a set of variants is the best. A/B tests allow organizations and policy-makers to make smarter, data-driven decisions that are less dependent on guesswork.
In the simplest version of an A/B test, subjects are randomly assigned to either the control group (group A) or the treatment group (group B). Subjects in the treatment group receive the treatment (such as a new medicine, a special offer, or a new web page design) while the control group proceeds as normal without the treatment. Data is then collected on the outcomes and used to study the effects of the treatment.
In data science, A/B tests are often used to choose between two or more candidate models in production, by measuring which model performs best in the real world. In this formulation, the control is often an existing model that is currently in production, sometimes called the champion. The treatment is a new model being considered to replace the old one. This new model is sometimes called the challenger. In our discussion, we’ll use the terms champion and challenger, rather than control and treatment.
When data is sent to a Wallaroo A/B test pipeline for inference, each datum is randomly sent to either the champion or challenger. After enough data has been sent to collect statistics on all the models in the A/B test pipeline, then those outcomes can be analyzed to determine the difference (if any) in the performance of the champion and challenger. Usually, the purpose of an A/B test is to decide whether or not to replace the champion with the challenger.
Keep in mind that in machine learning, the terms experiments and trials also often refer to the process of finding a training configuration that works best for the problem at hand (this is sometimes called hyperparameter optimization). In this guide, we will use the term experiment to refer to the use of A/B tests to compare the performance of different models in production.
Exercise: Create some challenger models and upload them to Wallaroo
Use the computer vision data from Notebook 1 to create at least one alternate computer vision model. You can do this by varying the modeling algorithm, the inputs, the feature engineering, or all of the above.
For the purpose of these exercises, please make sure that the predictions from the new model(s) are in the same units as the (champion) model that you created in Chapter 3. For example, if the champion model predicts log price, then the challenger models should also predict log price. If the champion model predicts price in units of $10,000, then the challenger models should, also.
- If you prefer to shortcut this step, you can use some of the pretrained model Python model files in the
models
directory- If the Python models are used, ensure that the proper input and output schemas are set. See the N1_deploy_a_model notebook for instructions.
- Upload your new model(s) to Wallaroo, into your workspace
At the end of this exercise, you should have at least one challenger model to compare to your champion model uploaded to your workspace.
# blank space to train, convert, and upload new model
resnet_model_name = 'resnet'
resnet_model_path = "../models/frcnn-resnet.pt.onnx"
resnet_model = wl.upload_model(resnet_model_name,
resnet_model_path,
framework=Framework.ONNX).configure('onnx',
batch_config="single"
)
There are a number of considerations to designing an A/B test; you can check out the article The What, Why, and How of A/B Testing for more details. In these exercises, we will concentrate on the deployment aspects. You will need a champion model and at least one challenger model. You also need to decide on a data split: for example 50-50 between the champion and challenger, or a 2:1 ratio between champion and challenger (two-thirds of the data to the champion, one-third to the challenger).
As an example of creating an A/B test deployment, suppose you have a champion model called “champion”, that you have been running in a one-step pipeline called “pipeline”. You now want to compare it to a challenger model called “challenger”. For your A/B test, you will send two-thirds of the data to the champion, and the other third to the challenger. Both models have already been uploaded.
To help you with the exercises, here some convenience functions to retrieve a models and pipelines that have been previously uploaded to your workspace (in this example, wl
is your wallaroo.client()
object).
# Get the most recent version of a model.
# Assumes that the most recent version is the first in the list of versions.
# wl.get_current_workspace().models() returns a list of models in the current workspace
def get_model(mname, modellist=wl.get_current_workspace().models()):
model = [m.versions()[-1] for m in modellist if m.name() == mname]
if len(model) <= 0:
raise KeyError(f"model {mname} not found in this workspace")
return model[0]
# get a pipeline by name in the workspace
def get_pipeline(pname, plist = wl.get_current_workspace().pipelines()):
pipeline = [p for p in plist if p.name() == pname]
if len(pipeline) <= 0:
raise KeyError(f"pipeline {pname} not found in this workspace")
return pipeline[0]
# use the space here for retrieving the models and pipeline
mobilenet_model_name = 'mobilenet'
module_post_process_name = "cv-post-process-drift-detection"
mobilenet_model = get_model(mobilenet_model_name)
module_post_process_model = get_model(module_post_process_name)
pipeline_name = 'cv-retail-pipeline'
pipeline = get_pipeline(pipeline_name)
Pipelines may have already been issued with pipeline steps. Pipeline steps can be removed or replaced with other steps.
The easiest way to clear all pipeline steps is with the Pipeline clear()
method.
To remove one step, use the Pipeline remove_step(index)
method, where index
is the step number ordered from zero. For example, if a pipeline has one step, then remove_step(0)
would remove that step.
To replace a pipeline step, use the Pipeline replace_with_model_step(index, model)
, where index
is the step number ordered from zero, and the model
is the model to be replacing it with.
Updated pipeline steps are not saved until the pipeline is redeployed with the Pipeline deploy()
method.
Reference: Wallaroo SDK Essentials Guide: Pipeline Management
.
For A/B testing, pipeline steps are added or replace an existing step.
To add a A/B testing step use the Pipeline add_random_split
method with the following parameters:
Parameter | Type | Description |
---|---|---|
champion_weight | Float (Required) | The weight for the champion model. |
champion_model | Wallaroo.Model (Required) | The uploaded champion model. |
challenger_weight | Float (Required) | The weight of the challenger model. |
challenger_model | Wallaroo.Model (Required) | The uploaded challenger model. |
hash_key | String(Optional) | A key used instead of a random number for model selection. This must be between 0.0 and 1.0. |
Note that multiple challenger models with different weights can be added as the random split step.
In this example, a pipeline will be built with a 2:1 weighted ratio between the champion and a single challenger model.
pipeline.add_random_split([(2, control), (1, challenger)]))
To replace an existing pipeline step with an A/B testing step use the Pipeline replace_with_random_split
method.
Parameter | Type | Description |
---|---|---|
index | Integer (Required) | The pipeline step being replaced. |
champion_weight | Float (Required) | The weight for the champion model. |
champion_model | Wallaroo.Model (Required) | The uploaded champion model. |
challenger_weight | Float (Required) | The weight of the challenger model. |
challenger_model | Wallaroo.Model (Required) | The uploaded challenger model. |
hash_key | String(Optional) | A key used instead of a random number for model selection. This must be between 0.0 and 1.0. |
This example replaces the first pipeline step with a 2:1 champion to challenger radio.
pipeline.replace_with_random_split(0,[(2, control), (1, challenger)]))
In either case, the random split will randomly send inference data to one model based on the weighted ratio. As more inferences are performed, the ratio between the champion and challengers will align more and more to the ratio specified.
Reference: Wallaroo SDK Essentials Guide: Pipeline Management A/B Testing
.
Then creating an A/B test deployment would look something like this:
First get the models used.
# retrieve handles to the most recent versions
# of the champion and challenger models
champion = get_model("champion")
challenger = get_model("challenger")
Second step is to retrieve the pipeline created in the previous Notebook, then redeploy it with the A/B testing split step.
Here’s some sample code:
# get an existing single-step pipeline and undeploy it
pipeline = get_pipeline("pipeline")
pipeline.undeploy()
# clear the pipeline and add a random split
pipeline.clear()
pipeline.add_random_split([(2, champion), (1, challenger)])
# add in the post-processing step as a normal step
pipeline.add_model_step(module_post_process)
pipeline.deploy()
The above code clears out all the steps of the pipeline and adds a new step with a A/B test deployment, where the incoming data is randomly sent in a 2:1 ratio to the champion and the challenger, respectively.
You can add multiple challengers to an A/B test::
pipeline.add_random_split([ (2, champion), (1, challenger01), (1, challenger02) ])
This pipeline will distribute data in the ratio 2:1:1 (or half to the champion, a quarter each to the challlengers) to the champion and challenger models, respectively.
You can also create an A/B test deployment from scratch:
pipeline = wl.build_pipeline("pipeline")
pipeline.add_random_split([(2, champion), (1, challenger)])
Exercise: Create an A/B test deployment of your house price models
Use the champion and challenger models that you created in the previous exercises to create an A/B test deployment. You can either create one from scratch, or reconfigure an existing pipeline.
- Send half the data to the champion, and distribute the rest among the challenger(s).
At the end of this exercise, you should have an A/B test deployment and be ready to compare multiple models.
# blank space to retrieve pipeline and redeploy with a/b testing step
# blank space to get the model(s)
pipeline.undeploy()
pipeline.clear()
pipeline.add_random_split([(1, mobilenet_model), (1, resnet_model)])
pipeline.deploy()
name | cv-retail-pipeline |
---|---|
created | 2023-08-04 19:23:11.179176+00:00 |
last_updated | 2023-08-14 14:34:08.984836+00:00 |
deployed | True |
tags | |
versions | cfe4f208-f01e-4fe2-a3f4-2da0b42b3547, 5bc400fb-4ea7-4e84-80a7-8c60794c3575, 6500b4d3-4031-4b0c-b219-4af552e3100c, db77cdac-d02f-4f85-a7fa-35833a480eff, fdee0a8f-e540-4c48-bd14-628a3417f24f, 5a59498b-f2f2-4254-9ca4-5c19460bb42a, e6be0eb9-f387-471a-8e6b-4b5d845e8aec, 4e74bc78-3501-4135-a3ff-8e24a9132d0f, a719a198-bb04-462e-bd6f-85644c357e62, 5f280423-8ac1-45b7-b645-27b15e0bd7d4, 9eb22dbc-c035-4ac4-bba9-b7cd3a9f30ba, 5ce99fc6-4463-4ab0-abbe-8b490ce9fc29, 8faa0d21-11ed-4186-8f5d-a586ead7ab00, 305db319-db20-4be8-94a7-ecb3d8bee4d4, 15cc7825-03a1-4794-8a31-744d290db193 |
steps | mobilenet |
The pipeline steps are displayed with the Pipeline steps()
method. This is used to verify the current deployed steps in the pipeline.
- IMPORTANT NOTE: Verify that the pipeline is deployed before checking for pipeline steps. Deploying the pipeline sets the steps into the Wallaroo system - until that happens, the steps only exist in the local system as potential steps.
# blank space to get the current pipeline steps
pipeline.steps()
[{'RandomSplit': {'hash_key': None, 'weights': [{'model': {'name': 'mobilenet', 'version': '484fffe8-70fe-44b9-937f-e98838bcc245', 'sha': '9044c970ee061cc47e0c77e20b05e884be37f2a20aa9c0c3ce1993dbd486a830'}, 'weight': 1}, {'model': {'name': 'resnet', 'version': '5aaf7fbc-81aa-40ad-b784-721f203d9532', 'sha': '43326e50af639105c81372346fb9ddf453fea0fe46648b2053c375360d9c1647'}, 'weight': 1}]}}]
Please note that for batch inferences, the entire batch will be sent to the same model. So in order to verify that your pipeline is distributing inferences in the proportion you specified, you will need to send your queries one datum at a time.
To help with the next exercise, here is another convenience function you might find useful.
# get the names of the inferring models
# from a dataframe of a/b test results
def get_names(resultframe):
modelcol = resultframe['out._model_split']
jsonstrs = [mod[0] for mod in modelcol]
return [json.loads(jstr)['name'] for jstr in jsonstrs]
Here’s an example of how to send a large number of queries one at a time to your pipeline in the SDK
results = []
# get a list of result frames
for i in range(1000):
query = get_singleton(testdata, i)
results.append(pipeline.infer(query))
# make one data frame of all results
allresults = pd.concat(results, ignore_index=True)
# add a column to indicate which model made the inference
allresults['modelname'] = get_names(allresults)
# get the counts of how many inferences were made by each model
allresults.modelname.value_counts()
- NOTE: Performing 1,000 inferences sequentially may take several minutes to complete. Adjust the range for time as required.
As with the single-step pipeline, the model predictions will be in a column named out.<outputname>
. In addition, there will be a column named out._model_split
that contains information about the model that made a particular prediction. The get_names()
convenience function above extracts the model name from the out._model_split
column.
Exercise: Send some queries to your A/B test deployment
- Send a single datum to the A/B test pipeline you created in the previous exercise. You can use the same test data set that you created/downloaded in the previous notebook. Observe what the inference result looks like. If you send the singleton through the pipeline multiple times, you should observe that the model making the inference changes.
- Send a large number of queries (at least 100) one at a time to the pipeline.
- Note that approximately half the inferences were made by the champion model.
- The remaining inferences should be distributed as you specified.
The more queries you send, the closer the distribution should be to what you specified.
If you can align the actual house prices from your test data to the predictions, you can also compare the accuracy of the different models.
Don’t forget to undeploy your pipeline after you are done, to free up resources.
## blank space to test one inference
## blank space to create test data, and send some data to your model
## blank space to create test data, and send some data to your model
image = '../data/images/input/example/dairy_bottles.png'
width, height = 640, 480
dfImage, resizedImage = cvDemo.loadImageAndConvertToDataframe(image,
width,
height
)
for i in range(10):
results = pipeline.infer(dfImage, timeout=60)
# display(results)
display(get_names(results))
['resnet']
[‘resnet’]
[‘resnet’]
[‘mobilenet’]
[‘mobilenet’]
[‘resnet’]
[‘resnet’]
[‘resnet’]
[‘mobilenet’]
[‘mobilenet’]
Shadow Deployments
Another way to vet your new model is to set it up in a shadow deployment. With shadow deployments, all the models in the experiment pipeline get all the data, and all inferences are recorded. However, the pipeline returns only one “official” prediction: the one from default, or champion model.
Shadow deployments are useful for “sanity checking” a model before it goes truly live. For example, you might have built a smaller, leaner version of an existing model using knowledge distillation or other model optimization techniques, as discussed here. A shadow deployment of the new model alongside the original model can help ensure that the new model meets desired accuracy and performance requirements before it’s put into production.
As an example of creating a shadow deployment, suppose you have a champion model called “champion”, that you have been running in a one-step pipeline called “pipeline”. You now want to put a challenger model called “challenger” into a shadow deployment with the champion. Both models have already been uploaded.
Shadow deployments can be added as a pipeline step, or replace an existing pipeline step.
Shadow deployment steps are added with the add_shadow_deploy(champion, [model2, model3,...])
method, where the champion
is the model that the inference results will be returned. The array of models listed after are the models where inference data is also submitted with their results displayed as as shadow inference results.
Shadow deployment steps replace an existing pipeline step with the replace_with_shadow_deploy(index, champion, [model2, model3,...])
method. The index
is the step being re