Wallaroo SDK Essentials Guide: ML Workload Orchestration

How to create and manage ML Workload Orchestration through the Wallaroo SDK

Wallaroo provides ML Workload Orchestrations and Tasks to automate processes in a Wallaroo instance. For example:

Deploy a pipeline, retrieve data through a data connector, submit the data for inferences, undeploy the pipeline
Replace a model with a new version
Retrieve shadow deployed inference results and submit them to a database

Orchestration Flow

ML Workload Orchestration flow works within 3 tiers:

Tier	Description
ML Workload Orchestration	User created custom instructions that provide automated processes that follow the same steps every time without error. Orchestrations contain the instructions to be performed, uploaded as a .ZIP file with the instructions, requirements, and artifacts.
Task	Instructions on when to run an Orchestration as a scheduled Task. Tasks can be Run Once, where is creates a single Task Run, or Run Scheduled, where a Task Run is created on a regular schedule based on the Kubernetes cronjob specifications. If a Task is Run Scheduled, it will create a new Task Run every time the schedule parameters are met until the Task is killed.
Task Run	The execution of an task. These validate business operations are successful identify any unsuccessful task runs. If the Task is Run Once, then only one Task Run is generated. If the Task is a Run Scheduled task, then a new Task Run will be created each time the schedule parameters are met, with each Task Run having its own results and logs.

One example may be of making donuts.

The ML Workload Orchestration is the recipe.
The Task is the order to make the donuts. It might be Run Once, so only one set of donuts are made, or Run Scheduled, so donuts are made every 2nd Sunday at 6 AM. If Run Scheduled, the donuts are made every time the schedule hits until the order is cancelled (aka killed).
The Task Run are the donuts with their own receipt of creation (logs, etc).

Orchestration Requirements

Orchestrations are uploaded to the Wallaroo instance as a ZIP file with the following requirements:

Parameter	Type	Description
User Code	(Required) Python script as `.py` files	If `main.py` exists, then that will be used as the task entrypoint. Otherwise, the first `main.py` found in any subdirectory will be used as the entrypoint. If no `main.py` is found, the orchestration will not be accepted.
Python Library Requirements	(Optional) `requirements.txt` file in the requirements file format.	A standard Python requirements.txt for any dependencies to be provided in the task environment. The Wallaroo SDK will already be present and should not be included in the requirements.txt. Multiple requirements.txt files are not allowed.
Other artifacts		Other artifacts such as files, data, or code to support the orchestration.

Zip Instructions

In a terminal with the zip command, assemble artifacts as above and then create the archive. The zip command is included by default with the Wallaroo JupyterHub service.

zip commands take the following format, with {zipfilename}.zip as the zip file to save the artifacts to, and each file thereafter as the files to add to the archive.

zip {zipfilename}.zip file1, file2, file3....

For example, the following command will add the files main.py and requirements.txt into the file hello.zip.

$ zip hello.zip main.py requirements.txt 
  adding: main.py (deflated 47%)
  adding: requirements.txt (deflated 52%)

Example requirements.txt file

dbt-bigquery==1.4.3
dbt-core==1.4.5
dbt-extractor==0.4.1
dbt-postgres==1.4.5
google-api-core==2.8.2
google-auth==2.11.0
google-auth-oauthlib==0.4.6
google-cloud-bigquery==3.3.2
google-cloud-bigquery-storage==2.15.0
google-cloud-core==2.3.2
google-cloud-storage==2.5.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.4

Orchestration Recommendations

The following recommendations will make using Wallaroo orchestrations.

The version of Python used should match the same version as in the Wallaroo JupyterHub service.
The same version of the Wallaroo SDK should match the server. For a 2023.2.1 Wallaroo instance, use the Wallaroo SDK version 2023.2.1.
Specify the version of pip dependencies.
The wallaroo.Client constructor auth_type argument is ignored. Using wallaroo.Client() is sufficient.
The following methods will assist with orchestrations:
- wallaroo.in_task() : Returns True if the code is running within an orchestration task.
- wallaroo.task_args(): Returns a Dict of invocation-specific arguments passed to the run_ calls.
Orchestrations will be run in the same way as running within the Wallaroo JupyterHub service, from the version of Python libraries (unless specifically overridden by the requirements.txt setting, which is not recommended), and running in the virtualized directory /home/jovyan/.

Orchestration Code Samples

The following demonstres using the wallaroo.in_task() and wallaroo.task_args() methods within an Orchestration. This sample code uses wallaroo.in_task() to verify whether or not the script is running as a Wallaroo Task. If true, it will gather the wallaroo.task_args() and use them to set the workspace and pipeline. If False, then it sets the pipeline and workspace manually.

# get the arguments
wl = wallaroo.Client()

# if true, get the arguments passed to the task
if wl.in_task():
  arguments = wl.task_args()
  
  # arguments is a key/value pair, set the workspace and pipeline name
  workspace_name = arguments['workspace_name']
  pipeline_name = arguments['pipeline_name']
  
# False:  We're not in a Task, so set the pipeline manually
else:
  workspace_name="bigqueryworkspace"
  pipeline_name="bigquerypipeline"

Orchestration Methods

The following methods are provided for creating and listing orchestrations.

Create Orchestration

An orchestration is created through the Wallaroo Client upload_orchestration(path) with the following parameters.

For the uploads, either the path to the .zip file is required, or bytes_buffer with name are required. path can not be used with bytes_buffer and name, and vice versa.

Parameter	Type	Description
path	String (Optional)	The path to the .zip file that contains the orchestration package. Can not be use with `bytes_buffer` and `name` are used.
file_name	String (Optional)	The file name to give to the zip file when uploaded.
bytes_buffer	[bytes] (Optional)	The .zip file object to be uploaded. Can not be used with `path`. Note that if the zip file is uploaded as from the `bytes_buffer` parameter and `file_name` is not included, then the file name in the Wallaroo orchestrations list will be `-`.
name	String (Optional)	Sets the name of the byte uploaded zip file.

List Orchestrations

All orchestrations for a Wallaroo instances are listed via the Wallaroo Client list_orchestrations() method. It returns an array with the following.

Parameter	Type	Description
id	String	The UUID identifier for the orchestration.
last run status	String	The last reported status the task. Valid values are: `packaging`: The orchestration has been upload and is being prepared. `ready`: The orchestration is available to be used as a task.
sha	String	The sha value of the uploaded orchestration.
name	String	The name of the orchestration
filename	String	The name of the uploaded orchestration file.
created at	DateTime	The date and time the orchestration was uploaded to the Wallaroo instance.
updated at	DateTime	The date and time a new version of the orchestration was uploaded.

wl.list_orchestrations()

id	name	status	filename	sha	created at	updated at
0f90e606-09f8-409b-a306-cb04ec4c011a	comprehensive sample	ready	remote_inference.zip	b88e93...2396fb	2023-22-May 19:55:15	2023-22-May 19:56:09

Task Methods

Tasks are the implementation of an orchestration. Think of the orchestration as the instructions to follow, and the Task is the unit actually doing it.

Tasks are set at the workspace level.

Create Tasks

Tasks are created from an orchestration through the following methods.

Task Type	Description
`run_once`	Run the task once.
`run_scheduled`	Run on a schedule, repeat every time the schedule fits the task until it is killed.

Tasks have the following parameters.

Parameter	Type	Description
id	String	The UUID identifier for the task.
last run status	String	The last reported status the task. Values are: `unknown`: The task has not been started or is being prepared. `ready`: The task is scheduled to execute. `running`: The task has started. `failure`: The task failed. `success`: The task completed.
type	String	The type of the task. Values are: `Temporary Run`: The task runs once then stop. `Scheduled Run`: The task repeats on a `cron` like schedule. `Service Run`: The task runs as a service and executes when its service port is activated.
active	Boolean	`True`: The task is scheduled or running. `False`: The task has completed or has been issued the `kill` command.
schedule	String	The `cron` style schedule for the task. If the task is not a scheduled one, then the schedule will be `-`.
created at	DateTime	The date and time the task was started.
updated at	DateTime	The date and time the task was updated.

Run Task Once

Temporary Run tasks are created from the Orchestration run_once(name, json_args, timeout) with the following parameters.

Parameter	Type	Description
name	String (Required)	The designated name of the task.
json_args	Dict (Required)	Arguments for the orchestration, such as `{ "dogs": 3.9, "cats": 8.1}`
timeout	int (Optional)	Timeout period in seconds.

task = orchestration.run_once(name="house price run once 2", json_args={"workspace_name": workspace_name, 
                                                                           "pipeline_name":pipeline_name,
                                                                           "connection_name": connection_name
                                                                           }
                            )
task

Field	Value
ID	f0e27d6a-6a98-4d26-b240-266f08560c48
Name	house price run once 2
Last Run Status	unknown
Type	Temporary Run
Active	True
Schedule	-
Created At	2023-22-May 19:58:32
Updated At	2023-22-May 19:58:32

Run Task Scheduled

A task can be scheduled via the Orchestration run_scheduled method.

Scheduled tasks are run every time the schedule period is met. This uses the same settings as the cron utility.

Scheduled tasks include the following parameters.

Parameter	Type	Description
name	String (Required)	The name of the task.
schedule	String (Required)	Schedule in the `cron` format of: `hour, minute, day_of_week, day_of_month, month`.
timeout	int (Optional)	Timeout period in seconds.
json_args	Dict (Required)	Arguments for the task, such as `{ "dogs": 3.9, "cats": 8.1}`

The schedule uses the same method as the cron service. For example, the following schedule:

schedule={'42 * * * *'}

Runs on the 42nd minute of every hour. The following schedule:

schedule={'00 1 * * 0'}

Indicates “At 1:00 AM on Sunday.”

For a shortcut in creating cron formatted schedules, see sites such as the Cron expression generator by Cronhub.

task_scheduled = orchestration.run_scheduled(name="schedule example", 
                                             timeout=600, 
                                             schedule=schedule, 
                                             json_args={"workspace_name": workspace_name, 
                                                        "pipeline_name": pipeline_name,
                                                        "connection_name": connection_name
                                            })
task_scheduled

Field	Value
ID	4af57c61-dfa9-43eb-944e-559135495df4
Name	schedule example
Last Run Status	unknown
Type	Scheduled Run
Active	True
Schedule	/5 * * *
Created At	2023-22-May 20:08:25
Updated At	2023-22-May 20:08:25

List Tasks

The list of tasks in the Wallaroo instance is retrieves through the Wallaroo Client list_tasks() method that accepts the following parameters.

Parameter	Type	Description
killed	Boolean (Optional Default: `False`)	Returns tasks depending on whether they have been issued the `kill` command. `False` returns all tasks whether killed or not. `True` only returns killed tasks.

This returns an array list of the following in reverse chronological order from updated at.

Parameter	Type	Description
id	String	The UUID identifier for the task.
last run status	String	The last reported status the task. Values are: `unknown`: The task has not been started or is being prepared. `ready`: The task is scheduled to execute. `running`: The task has started. `failure`: The task failed. `success`: The task completed.
type	String	The type of the task. Values are: `Temporary Run`: The task runs once then stop. `Scheduled Run`: The task repeats on a `cron` like schedule. `Service Run`: The task runs as a service and executes when its service port is activated.
active	Boolean	`True`: The task is scheduled or running. `False`: The task has completed or has been issued the `kill` command.
schedule	String	The `cron` style schedule for the task. If the task is not a scheduled one, then the schedule will be `-`.
created at	DateTime	The date and time the task was started.
updated at	DateTime	The date and time the task was updated.

For example:

wl.list_tasks()

id	name	last run status	type	active	schedule	created at	updated at
f0e27d6a-6a98-4d26-b240-266f08560c48	house price run once 2	running	Temporary Run	True	-	2023-22-May 19:58:32	2023-22-May 19:58:38
36509ef8-98da-42a0-913f-e6e929dedb15	house price run once	success	Temporary Run	True	-	2023-22-May 19:56:37	2023-22-May 19:56:48

An individual task can be retrieved through the list_tasks() by specifying the task from the array returned. In this example, the first task listed from the list_tasks() method will be assigned to the task variable.

task = wl.list_tasks()[0]

Get Task Status

The status of a task is retrieved through the Task status() method and returns the following.

Parameter	Type	Description
status	String	The current status of the task. Values are: `pending`: The task has not been started or is being prepared. `started`: The task has started to execute.

display(task2.status())
'started'

Kill a Task

Killing a task removes the schedule or removes it from a service. Tasks are killed with the Task kill() method, and returns a message with the status of the kill procedure.

Note that a Task set to Run Scheduled will generate a new Task Run each time the schedule parameters are met until the Task is killed. A Task set to Run Once will generate only one Task Run, so does not need to be killed.

task2.kill()

<ArbexStatus.PENDING_KILL: 'pending_kill'>

Task Runs

Task Runs are generated from a Task. If the Task is Run Once, then only one Task Run is generated. If the Task is a Run Scheduled task, then a new Task Run will be created each time the schedule parameters are met, with each Task Run having its own results and logs.

Task Last Runs History

The history of a task, which each deployment of the task is known as a task run is retrieved with the Task last_runs method that takes the following arguments.

Parameter	Type	Description
status	String (Optional *Default: `all`)	Filters the task history by the `status`. If `all`, returns all statuses. Status values are: `running`: The task has started. `failure`: The task failed. `success`: The task completed.
limit	Integer (Optional)	Limits the number of task runs returned.

This returns the following in reverse chronological order by updated at.

Parameter	Type	Description
task id	String	Task id in UUID format.
pod id	String	Pod id in UUID format.
status	String	Status of the task. Status values are: `running`: The task has started. `failure`: The task failed. `success`: The task completed.
created at	DateTime	Date and time the task was created at.
updated at	DateTime	Date and time the task was updated.

task.last_runs()

task id	pod id	status	created at	updated at
f0e27d6a-6a98-4d26-b240-266f08560c48	7d9d73d5-df11-44ed-90c1-db0e64c7f9b8	success	2023-22-May 19:58:35	2023-22-May 19:58:35

Task Run Logs

The output of a task is displayed with the Task Run logs() method that takes the following parameters.

Parameter	Type	Description
limit	Integer (Optional)	Limits the lines returned from the task run log. The `limit` parameter is based on the log tail - starting from the last line of the log file, then working up until the limit of lines is reached. This is useful for viewing final outputs, exceptions, etc.

The Task Run logs() returns the log entries as a string list, with each entry as an item in the list.

IMPORTANT NOTE: It may take around a minute for task run logs to be integrated into the Wallaroo log database.

# give time for the task to complete and the log files entered
time.sleep(60)
recent_run = task.last_runs()[0]
display(recent_run.logs())

2023-22-May 19:59:29 Getting the pipeline orchestrationpipelinetgiq
2023-22-May 19:59:29 Getting arrow table file
2023-22-May 19:59:29 Inference time.  Displaying results after.
2023-22-May 19:59:29 pyarrow.Table
2023-22-May 19:59:29 time: timestamp[ms]
2023-22-May 19:59:29 in.tensor: list<item: float> not null
2023-22-May 19:59:29   child 0, item: float
2023-22-May 19:59:29 out.variable: list<inner: float not null> not null
2023-22-May 19:59:29 check_failures: int8
2023-22-May 19:59:29   child 0, inner: float not null
2023-22-May 19:59:29 ----
2023-22-May 19:59:29 time: [[2023-05-22 19:58:49.767,2023-05-22 19:58:49.767,2023-05-22 19:58:49.767,2023-05-22 19:58:49.767,2023-05-22 19:58:49.767,...,2023-05-22 19:58:49.767,2023-05-22 19:58:49.767,2023-05-22 19:58:49.767,2023-05-22 19:58:49.767,2023-05-22 19:58:49.767]]
2023-22-May 19:59:29 in.tensor: [[[4,2.5,2900,5505,2,...,2970,5251,12,0,0],[2,2.5,2170,6361,1,...,2310,7419,6,0,0],...,[3,1.75,2910,37461,1,...,2520,18295,47,0,0],[3,2,2005,7000,1,...,1750,4500,34,0,0]]]
2023-22-May 19:59:29 check_failures: [[0,0,0,0,0,...,0,0,0,0,0]]
2023-22-May 19:59:29 out.variable: [[[718013.75],[615094.56],...,[706823.56],[581003]]]</code></pre>