Real Estate: ML Workload Orchestrations
Tutorial Notebook 4: Automation with ML Workload Orchestrations
Wallaroo provides Data Connections and ML Workload Orchestrations to provide organizations with a method of creating and managing automated tasks that can either be run on demand or a regular schedule.
Prerequisites
- A Wallaroo instance version 2023.2.1 or greater.
References
- Wallaroo SDK Essentials Guide: Model Uploads and Registrations: Python Models
- Wallaroo SDK Essentials Guide: Pipeline Management
- Wallaroo SDK Essentials Guide: ML Workload Orchestration
Orchestrations, Taks, and Tasks Runs
We’ve details how Wallaroo Connections work. Now we’ll use Orchestrations, Tasks, and Task Runs.
Item | Description |
---|---|
Orchestration | ML Workload orchestration allows data scientists and ML Engineers to automate and scale production ML workflows in Wallaroo to ensure a tight feedback loop and continuous tuning of models from training to production. Wallaroo platform users (data scientists or ML Engineers) have the ability to deploy, automate and scale recurring batch production ML workloads that can ingest data from predefined data sources to run inferences in Wallaroo, chain pipelines, and send inference results to predefined destinations to analyze model insights and assess business outcomes. |
Task | An implementation of an Orchestration. Tasks can be either Run Once : They run once and upon completion, stop. Run Scheduled : The task runs whenever a specific cron like schedule is reached. Scheduled tasks will run until the kill command is issued. |
Task Run | The execusion of a task. For Run Once tasks, there will be only one Run Task . A Run Scheduled tasks will have multiple tasks, one for every time the schedule parameter is met. Task Runs have their own log files that can be examined to track progress and results. |
Statsmodel Forecast Connection Steps
Import Libraries
The first step is to import the libraries that we will need.
import json
import os
import datetime
import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework
# used to display dataframe information without truncating
from IPython.display import display
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', None)
import time
display(wallaroo.__version__)
'2023.2.1'
Connect to the Wallaroo Instance
The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
# Login through local Wallaroo instance
wl = wallaroo.Client()
Set the Workspace and Pipeline
The workspace will be either used or created if it does not exist, along with the pipeline. The models uploaded in the Upload and Deploy tutorial are referenced in this step.
## convenience functions from the previous notebooks
## these functions assume your connection to wallaroo is called wl
# return the workspace called <name>, or create it if it does not exist.
# this function assumes your connection to wallaroo is called wl
def get_workspace(name):
workspace = None
for ws in wl.list_workspaces():
if ws.name() == name:
workspace= ws
if(workspace == None):
workspace = wl.create_workspace(name)
return workspace
# pull a single datum from a data frame
# and convert it to the format the model expects
def get_singleton(df, i):
singleton = df.iloc[i,:].to_numpy().tolist()
sdict = {'tensor': [singleton]}
return pd.DataFrame.from_dict(sdict)
# pull a batch of data from a data frame
# and convert to the format the model expects
def get_batch(df, first=0, nrows=1):
last = first + nrows
batch = df.iloc[first:last, :].to_numpy().tolist()
return pd.DataFrame.from_dict({'tensor': batch})
# Translated a column from a dataframe into a single array
# used for the Statsmodel forecast model
def get_singleton_forecast(df, field):
singleton = pd.DataFrame({field: [df[field].values.tolist()]})
return singleton
# Get the most recent version of a model in the workspace
# Assumes that the most recent version is the first in the list of versions.
# wl.get_current_workspace().models() returns a list of models in the current workspace
def get_model(mname):
modellist = wl.get_current_workspace().models()
model = [m.versions()[-1] for m in modellist if m.name() == mname]
if len(model) <= 0:
raise KeyError(f"model {mname} not found in this workspace")
return model[0]
# get a pipeline by name in the workspace
def get_pipeline(pname):
plist = wl.get_current_workspace().pipelines()
pipeline = [p for p in plist if p.name() == pname]
if len(pipeline) <= 0:
raise KeyError(f"pipeline {pname} not found in this workspace")
return pipeline[0]
# retrieve the workspace, pipeline and model
workspace_name = "tutorial-workspace-jch"
workspace = get_workspace(workspace_name)
# set your current workspace to the workspace that you just created
wl.set_current_workspace(workspace)
# optionally, examine your current workspace
wl.get_current_workspace()
pipeline_name = 'tutorialpipeline-jch'
pipeline = get_pipeline(pipeline_name)
control_model = get_model('tutorial-model')
Deploy Pipeline
The pipeline is already set witht the model. For our demo we’ll verify that it’s deployed.
# Set the deployment to allow for additional engines to run
# Undeploy and clear the pipeline in case it was used in other demonstrations
pipeline.undeploy()
pipeline.clear()
pipeline.add_model_step(control_model)
# pipeline.add_model_step(step)
pipeline.deploy()
name | tutorialpipeline-jch |
---|---|
created | 2023-08-03 19:36:31.732163+00:00 |
last_updated | 2023-08-03 19:51:09.492808+00:00 |
deployed | True |
tags | |
versions | 489fdc01-e8c2-4d72-9640-2f33416d3941, 1a73448b-9012-4258-bb2c-a4d25a1e6f19, d1d0cafe-78a9-4193-84af-cec1b3ed608b, 70438418-4802-4ced-a295-65ef78706fd4 |
steps | tutorial-model |
Sample Inference
Verify the pipeline is deployed properly with a sample inference.
# sample inference from previous code here
df_from_csv = pd.read_csv('../data/test_data.csv')
singleton = get_singleton(df_from_csv, 0)
display(singleton)
single_result = pipeline.infer(singleton)
display(single_result)
tensor | |
---|---|
0 | [4.0, 2.5, 2900.0, 5505.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2900.0, 0.0, 47.6063, -122.02, 2970.0, 5251.0, 12.0, 0.0, 0.0] |
time | in.tensor | out.variable | check_failures | |
---|---|---|---|---|
0 | 2023-08-03 19:51:27.001 | [4.0, 2.5, 2900.0, 5505.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2900.0, 0.0, 47.6063, -122.02, 2970.0, 5251.0, 12.0, 0.0, 0.0] | [704901.9] | 0 |
Sample Orchestration
The orchestration that will automate this process is ./real-estate-orchestration/real-estate-orchestration.zip
. The files used are stored in the directory forecast-orchestration
, created with the command:
zip -r real-estate-orchestration.zip real-estate-orchestration/
.
This contains the following:
requirements.txt
: The Python requirements file to specify the following libraries used. For this example, that will be empty since we will be using themain.py
: The entry file that uses a deployed pipeline and performs an inference request against it visible from its log files.data/
: Inference data sources.
The main.py
script performs a workspace and pipeline retrieval, then an inference against the inference input file.
import wallaroo
from wallaroo.object import EntityNotFoundError
import pandas as pd
wl = wallaroo.Client()
# get the arguments
arguments = wl.task_args()
if "workspace_name" in arguments:
workspace_name = arguments['workspace_name']
else:
workspace_name="forecast-model-tutorial"
if "pipeline_name" in arguments:
pipeline_name = arguments['pipeline_name']
else:
pipeline_name="bikedaypipe"
def get_workspace(name):
workspace = None
for ws in wl.list_workspaces():
if ws.name() == name:
workspace= ws
return workspace
def get_pipeline(name):
try:
pipeline = wl.pipelines_by_name(name)[0]
except EntityNotFoundError:
print(f"Pipeline not found:{name}")
return pipeline
def get_singleton_forecast(df, field):
singleton = pd.DataFrame({field: [df[field].values.tolist()]})
return singleton
print(f"Workspace: {workspace_name}")
workspace = get_workspace(workspace_name)
wl.set_current_workspace(workspace)
print(workspace)
# the pipeline is assumed to be deployed
print(f"Pipeline: {pipeline_name}")
pipeline = get_pipeline(pipeline_name)
print(pipeline)
print(pipeline.status())
# get sample inference data
df_from_csv = pd.read_csv('./data/test_data.csv')
singleton = get_singleton(df_from_csv, 0)
display(singleton)
single_result = pipeline.infer(singleton)
display(single_result)
print(results)
A few things to go over here. You’ll notice this is almost the exact procedures we’ve been following so far: we get a workspace and pipeline, pull data from a CSV file, and perform an inference off the data.
This script assumes that the pipeline has already been deployed, and also includes this part:
arguments = wl.task_args()
This allows us to pass arguments into a Task created from an Orchestration, so we can specify a different workspace, pipeline, or any other arguments we construct. This allows orchestrations to be very flexible.
Also, notice that it refers to a specific file:
sample_count = pd.read_csv('./data/test_data.csv')
In the forecast-orchestration
directory is the data
directory with our sample CSV file. Orchestrations can include additional artifacts. We could have used a Wallaroo Connection instead, and we encourage you to try that if you want.
Upload the Orchestration
Orchestrations are uploaded with the Wallaroo client upload_orchestration(path)
method with the following parameters.
Parameter | Type | Description |
---|---|---|
path | string (Required) | The path to the ZIP file to be uploaded. |
Once uploaded, the deployment will be prepared and any requirements will be downloaded and installed. A typical orchestration upload looks like this:
my_orchestration = wl.upload_orchestration(path-to-zip-file)
Try uploading our orchestration from ./forecast-orchestration/forecast-orchestration.zip
- or make your own and upload it.
Once uploaded, you can check the status with the status()
. If using the orchestration example above, that would be my_orchestration.status()
This is handy to make into a loop to check the status until is shows ready
.
orchestration = wl.upload_orchestration(name="house price orchestration example", path="../real-estate-orchestration/real-estate-orchestration.zip")
while orchestration.status() != 'ready':
print(orchestration.status())
time.sleep(5)
pending_packaging
pending_packaging
packaging
packaging
packaging
packaging
packaging
packaging
packaging
packaging
List Orchestrations
Orchestrations are listed with the Wallaroo Client list_orchestrations()
method. Orchestrations can be retrieved to a variable by allocated their position in the array - for example: orchestration = wl.list_orchestrations()[0]
would return the first orchestration on the list.
# list orchestration here
wl.list_orchestrations()
id | name | status | filename | sha | created at | updated at |
---|---|---|---|---|---|---|
68fc016f-a013-46a8-95e3-149d46f2ca0c | house price orchestration example | ready | real-estate-orchestration.zip | 2ca71f...8bff63 | 2023-03-Aug 19:51:33 | 2023-03-Aug 19:52:26 |
# retrieve the orchestration from the list
orchestration_from_list = wl.list_orchestration()[-1]
orchestration_from_list
Create the Task
The orchestration is now ready to be implemented as a Wallaroo Task. We’ll just run it once as an example. This specific Orchestration that creates the Task assumes that the pipeline is deployed, and accepts the arguments:
- workspace_name
- pipeline_name
Tasks are either Run Once, or Run Scheduled. We create a new task from the Orchestration with either run_once(task_name, json_args, timeout)
or with run_scheduled(name, timeout,schedule,json_args)
. The schedule is based on the Kubernetes cron scheduler. For example:
schedule={'42 * * * *'}
Runs every 42 minutes and contains the answer to life, the universe, and everything.
Creating a scheduled task might be:
task_scheduled = orchestration.run_scheduled(name="schedule example",
timeout=600,
schedule=schedule,
json_args={"workspace_name": workspace_name,
"pipeline_name": pipeline_name})
Using the uploaded orchestration, create a Run Once task using your workspace and pipeline names as the json_args
.
# create your task here
# create the task here
task = orchestration.run_once(name="real estate task",
json_args={"workspace_name":workspace_name,
"pipeline_name":pipeline_name}
)
Monitor Run with Task Status
We’ll monitor the run first with it’s status with the Task.status()
command.
Get the status of the task, and once it is started
proceed to the next step. Try doing it as a while
loop if you feel confident.
while task.status() != "started":
display(task.status())
time.sleep(5)
'pending'
List Tasks
We’ll use the Wallaroo client list_tasks
method to view the tasks currently running or scheduled.
wl.list_tasks()
id | name | last run status | type | active | schedule | created at | updated at |
---|---|---|---|---|---|---|---|
8bec66fb-4f6e-4257-a476-cf39b593db22 | real estate task | success | Temporary Run | True | - | 2023-03-Aug 19:52:51 | 2023-03-Aug 19:52:56 |
Display Task Run Results
The Task Run is the implementation of the task - the actual running of the script and it’s results. Tasks that are Run Once will only have one Task Run, while a Task set to Run Scheduled will have a Task Run for each time the task is executed. Each Task Run has its own set of logs and results that are monitored through the Task Run logs()
method.
First, get the Task Run - this is the actual execution of a Task. The Task is the scheduled run of an Orchestration. The Task Run is the implementation of a scheduled Task. A Run Once Task while generate one Task Run, while a Scheduled Task generated a new Task Run every time the schedule pattern is met until the Task is killed.
We retrieve the task runs with the Task last_runs()
method, and assign a single Task Run to a variable by selecting it with the list with last_runs()[index]
. If you only have one Task Run from a Task, then you can just set the index
to 0.
Retrieve the task run for our generated task, then start checking the logs for our task run. It may take longer than 30 seconds to launch the task, so be prepared to run the .logs()
method again to view the logs.
task
Field | Value |
---|---|
ID | 8bec66fb-4f6e-4257-a476-cf39b593db22 |
Name | real estate task |
Last Run Status | success |
Type | Temporary Run |
Active | True |
Schedule | - |
Created At | 2023-03-Aug 19:52:51 |
Updated At | 2023-03-Aug 19:52:56 |
task_run = task.last_runs()[0]
The Task Run Status is checked with the _status
method. This lets you know if there was a failure or if it ran successfully. If it didn’t, you can still get the task run logs to find out why.
task_run._status
'success'
Retrieve Task Run Logs
The Task Run logs are retrieved with the Wallaroo task runs log()
, and shows the outputs of the results. This is why it’s useful to have print
commands in your code to track what it’s doing.
task_run.logs()
2023-03-Aug 19:52:58 Workspace: tutorial-workspace-jch
2023-03-Aug 19:52:58 {'name': 'tutorial-workspace-jch', 'id': 19, 'archived': False, 'created_by': '0a36fba2-ad42-441b-9a8c-bac8c68d13fa', 'created_at': '2023-08-03T19:34:42.324336+00:00', 'models': [{'name': 'tutorial-model', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2023, 8, 3, 19, 36, 31, 13200, tzinfo=tzutc()), 'created_at': datetime.datetime(2023, 8, 3, 19, 36, 31, 13200, tzinfo=tzutc())}], 'pipelines': [{'name': 'tutorialpipeline-jch', 'create_time': datetime.datetime(2023, 8, 3, 19, 36, 31, 732163, tzinfo=tzutc()), 'definition': '[]'}]}
2023-03-Aug 19:52:58 Pipeline: tutorialpipeline-jch
2023-03-Aug 19:52:58 {'name': 'tutorialpipeline-jch', 'create_time': datetime.datetime(2023, 8, 3, 19, 36, 31, 732163, tzinfo=tzutc()), 'definition': '[]'}
2023-03-Aug 19:52:58 {'status': 'Running', 'details': [], 'engines': [{'ip': '10.244.2.246', 'name': 'engine-6586d95b48-5k6zl', 'status': 'Running', 'reason': None, 'details': [], 'pipeline_statuses': {'pipelines': [{'id': 'tutorialpipeline-jch', 'status': 'Running'}]}, 'model_statuses': {'models': [{'name': 'tutorial-model', 'version': '68b3f094-1b0f-4f6e-940e-4dc1cb1500b2', 'sha': 'ed6065a79d841f7e96307bb20d5ef22840f15da0b587efb51425c7ad60589d6a', 'status': 'Running'}]}}], 'engine_lbs': [{'ip': '10.244.4.154', 'name': 'engine-lb-584f54c899-k8n25', 'status': 'Running', 'reason': None, 'details': []}], 'sidekicks': []}
2023-03-Aug 19:52:58 0 [4.0, 2.5, 2900.0, 5505.0, 2.0, 0.0, 0.0, 3.0,...
2023-03-Aug 19:52:58 tensor
2023-03-Aug 19:52:58 0 2023-08-03 19:52:58.828 ... 0
2023-03-Aug 19:52:58 time ... check_failures
2023-03-Aug 19:52:58
2023-03-Aug 19:52:58 [1 rows x 4 columns]
You have now walked through setting up a basic assay and running it over historical data.
Congratulations!
In this tutorial you have
- Deployed a single step house price prediction pipeline and sent data to it.
- Uploaded an ML Orchestration into Wallaroo.
- Created a Run Once Task from the Orchestration.
- Viewed the Task Run’s status generated from the Task.
- Viewed the Task Run’s logs.
Great job!
Cleaning up.
Now that the tutorial is complete, don’t forget to undeploy your pipeline to free up the resources.
pipeline.undeploy()
name | tutorialpipeline-jch |
---|---|
created | 2023-08-03 19:36:31.732163+00:00 |
last_updated | 2023-08-03 19:51:09.492808+00:00 |
deployed | False |
tags | |
versions | 489fdc01-e8c2-4d72-9640-2f33416d3941, 1a73448b-9012-4258-bb2c-a4d25a1e6f19, d1d0cafe-78a9-4193-84af-cec1b3ed608b, 70438418-4802-4ced-a295-65ef78706fd4 |
steps | tutorial-model |