Regular Batch Inference

Stage 4: Regular Batch Inference

In Stage 3: Deploy the Model in Wallaroo, the the housing model created and tested in Stage 2: Training Process Automation Setup was uploaded to a Wallaroo instance and added to the pipeline housing-pipe in the workspace housepricing. This pipeline can be deployed at any point and time and used with new inferences.

For the purposes of this demo, let’s say that every month we find the newly entered and still-unsold houses and predict their sale price.

The predictions are entered into a staging table, for further inspection before being joined to the primary housing data table.

We show this as a notebook, but this can also be scripted and scheduled, using CRON or some other process.

Resources

The following resources are used as part of this tutorial:

  • data
    • data/seattle_housing_col_description.txt: Describes the columns used as part data analysis.
    • data/seattle_housing.csv: Sample data of the Seattle, Washington housing market between 2014 and 2015.
  • code
    • postprocess.py: Formats the data after inference by the model is complete.
    • preprocess.py: Formats the incoming data for the model.
    • simdb.py: A simulated database to demonstrate sending and receiving queries.
    • wallaroo_client.py: Additional methods used with the Wallaroo instance to create workspaces, etc.
  • models
    • housing_model_xgb.onnx: Model created in Stage 2: Training Process Automation Setup.

Steps

This process will use the following steps:

Connect to Wallaroo

Connect to the Wallaroo instance and set the housepricing workspace as the current workspace.

import json
import pickle
import wallaroo
import pandas as pd
import numpy as np

import simdb # module for the purpose of this demo to simulate pulling data from a database

from wallaroo_client import get_workspace
wl = wallaroo.Client()
new_workspace = get_workspace("housepricing")
_ = wl.set_current_workspace(new_workspace)

Deploy the Pipeline

Deploy the housing-pipe workspace established in Stage 3: Deploy the Model in Wallaroo (03_deploy_model.ipynb).

pipeline = wl.pipelines_by_name("housing-pipe")[-1]
pipeline.deploy()
Waiting for deployment - this will take up to 45s ...... ok
name housing-pipe
created 2022-09-28 20:53:36.296407+00:00
last_updated 2022-09-28 21:19:22.233409+00:00
deployed True
tags
steps preprocess

Read In New House Listings

From the data store, load the previous month’s house listing and submit them to the deployed pipeline.

conn = simdb.simulate_db_connection()

# create the query
query = f"select * from {simdb.tablename} where date > DATE(DATE(), '-1 month') AND sale_price is NULL"
print(query)

# read in the data
newbatch = pd.read_sql_query(query, conn)
newbatch.shape
select * from house_listings where date > DATE(DATE(), '-1 month') AND sale_price is NULL


(1090, 22)
query = {'query': newbatch.to_json()}
result = pipeline.infer(query)[0]
predicted_prices = result.data()[0]
len(predicted_prices)
1090

Send Predictions to Results Staging Table

Take the predicted prices based on the inference results so they can be joined into the house_listings table.

Once complete, undeploy the pipeline to return the resources back to the Kubernetes environment.

result_table = pd.DataFrame({
    'id': newbatch['id'],
    'saleprice_estimate': predicted_prices,
})

result_table.to_sql('results_table', conn, index=False, if_exists='append')
# Display the top of the table for confirmation
pd.read_sql_query("select * from results_table limit 5", conn)
id saleprice_estimate
0 9215400105 508255.0
1 1695900060 500198.0
2 9545240070 539598.0
3 1432900240 270739.0
4 6131600075 191304.0
conn.close()
pipeline.undeploy()
Waiting for undeployment - this will take up to 45s .................................... ok
name housing-pipe
created 2022-09-28 20:53:36.296407+00:00
last_updated 2022-09-28 21:19:22.233409+00:00
deployed False
tags
steps preprocess

From here, organizations can automate this process. Other features could be used such as data analysis using Wallaroo assays, or other features such as shadow deployments to test champion and challenger models to find which models provide the best results.