For more details on this tutorial’s setup and process, see 00_Introduction.ipynb
.
Stage 4: Regular Batch Inference
In Stage 3: Deploy the Model in Wallaroo, the housing model created and tested in Stage 2: Training Process Automation Setup was uploaded to a Wallaroo instance and added to the pipeline housing-pipe
in the workspace housepricing
. This pipeline can be deployed at any point and time and used with new inferences.
For the purposes of this demo, let’s say that every month we find the newly entered and still-unsold houses and predict their sale price.
The predictions are entered into a staging table, for further inspection before being joined to the primary housing data table.
We show this as a notebook, but this can also be scripted and scheduled, using CRON or some other process.
Resources
The following resources are used as part of this tutorial:
- data
data/seattle_housing_col_description.txt
: Describes the columns used as part data analysis.data/seattle_housing.csv
: Sample data of the Seattle, Washington housing market between 2014 and 2015.
- code
postprocess.py
: Formats the data after inference by the model is complete.preprocess.py
: Formats the incoming data for the model.simdb.py
: A simulated database to demonstrate sending and receiving queries.wallaroo_client.py
: Additional methods used with the Wallaroo instance to create workspaces, etc.
- models
housing_model_xgb.onnx
: Model created in Stage 2: Training Process Automation Setup.
Steps
This process will use the following steps:
- Connect to Wallaroo: Connect to the Wallaroo instance and the
housepricing
workspace. - Deploy the Pipeline: Deploy the pipeline to prepare it to run inferences.
- Read In New House Listings: Read in the previous month’s house listings and submit them to the pipeline for inference.
- Send Predictions to Results Staging Table: Add the inference results to the results staging table.
Connect to Wallaroo
Connect to the Wallaroo instance and set the housepricing
workspace as the current workspace.
import json
import pickle
import wallaroo
import pandas as pd
import numpy as np
import simdb # module for the purpose of this demo to simulate pulling data from a database
from wallaroo_client import get_workspace
# used to display dataframe information without truncating
from IPython.display import display
pd.set_option('display.max_colwidth', None)
# Login through local Wallaroo instance
wl = wallaroo.Client()
# SSO login through keycloak
# wallarooPrefix = "YOUR PREFIX"
# wallarooSuffix = "YOUR PREFIX"
# wl = wallaroo.Client(api_endpoint=f"https://{wallarooPrefix}.api.{wallarooSuffix}",
# auth_endpoint=f"https://{wallarooPrefix}.keycloak.{wallarooSuffix}",
# auth_type="sso")
Arrow Support
As of the 2023.1 release, Wallaroo provides support for dataframe and Arrow for inference inputs. This tutorial allows users to adjust their experience based on whether they have enabled Arrow support in their Wallaroo instance or not.
If Arrow support has been enabled, arrowEnabled=True
. If disabled or you’re not sure, set it to arrowEnabled=False
The examples below will be shown in an arrow enabled environment.
import os
# Only set the below to make the OS environment ARROW_ENABLED to TRUE. Otherwise, leave as is.
# os.environ["ARROW_ENABLED"]="True"
if "ARROW_ENABLED" not in os.environ or os.environ["ARROW_ENABLED"].casefold() == "False".casefold():
arrowEnabled = False
else:
arrowEnabled = True
print(arrowEnabled)
True
def get_workspace(name):
workspace = None
for ws in wl.list_workspaces():
if ws.name() == name:
workspace= ws
if(workspace == None):
workspace = wl.create_workspace(name)
return workspace
def get_pipeline(name):
try:
pipeline = wl.pipelines_by_name(pipeline_name)[0]
except EntityNotFoundError:
pipeline = wl.build_pipeline(pipeline_name)
return pipeline
new_workspace = get_workspace("housepricing")
_ = wl.set_current_workspace(new_workspace)
Deploy the Pipeline
Deploy the housing-pipe
workspace established in Stage 3: Deploy the Model in Wallaroo (03_deploy_model.ipynb
).
pipeline = wl.pipelines_by_name("housing-pipe")[-1]
pipeline.deploy()
name | housing-pipe |
---|---|
created | 2023-02-27 21:00:26.107908+00:00 |
last_updated | 2023-02-27 21:01:46.967774+00:00 |
deployed | True |
tags | |
versions | fbb12aa3-7fa2-4553-9955-3dc7146bcd36, d92c7f3d-0b61-44fa-83e2-264d8a045879, b309144d-b5b0-4ca7-a073-4f4ad4145de7 |
steps | preprocess |
Read In New House Listings
From the data store, load the previous month’s house listing and submit them to the deployed pipeline.
conn = simdb.simulate_db_connection()
# create the query
query = f"select * from {simdb.tablename} where date > DATE(DATE(), '-1 month') AND sale_price is NULL"
print(query)
# read in the data
newbatch = pd.read_sql_query(query, conn)
newbatch.shape
select * from house_listings where date > DATE(DATE(), '-1 month') AND sale_price is NULL
(1090, 22)
if arrowEnabled is True:
query = {'query': newbatch.to_json()}
result = pipeline.infer(query)
#display(result)
predicted_prices = result[0]['prediction']
else:
query = {'query': newbatch.to_json()}
result = pipeline.infer(query)[0]
predicted_prices = result.data()[0]
len(predicted_prices)
1090
Send Predictions to Results Staging Table
Take the predicted prices based on the inference results so they can be joined into the house_listings
table.
Once complete, undeploy the pipeline to return the resources back to the Kubernetes environment.
result_table = pd.DataFrame({
'id': newbatch['id'],
'saleprice_estimate': predicted_prices,
})
result_table.to_sql('results_table', conn, index=False, if_exists='append')
# Display the top of the table for confirmation
pd.read_sql_query("select * from results_table limit 5", conn)
id | saleprice_estimate | |
---|---|---|
0 | 9215400105 | 508255.0 |
1 | 1695900060 | 500198.0 |
2 | 9545240070 | 539598.0 |
3 | 1432900240 | 270739.0 |
4 | 6131600075 | 191304.0 |
conn.close()
pipeline.undeploy()
name | housing-pipe |
---|---|
created | 2023-02-27 21:00:26.107908+00:00 |
last_updated | 2023-02-27 21:01:46.967774+00:00 |
deployed | False |
tags | |
versions | fbb12aa3-7fa2-4553-9955-3dc7146bcd36, d92c7f3d-0b61-44fa-83e2-264d8a045879, b309144d-b5b0-4ca7-a073-4f4ad4145de7 |
steps | preprocess |
From here, organizations can automate this process. Other features could be used such as data analysis using Wallaroo assays, or other features such as shadow deployments to test champion and challenger models to find which models provide the best results.