Data Connections Tutorials

How to use Wallaroo Connections to define external data connections and use for inferences.

This can be downloaded as part of the Wallaroo Tutorials repository.

Connections Comprehensive Tutorial

This tutorial provides a complete set of methods and examples regarding Wallaroo connections.

Wallaroo provides data connections, orchestrations, and tasks to provide organizations with a method of creating and managing automated tasks that can either be run on demand, on a regular schedule, or as a service so they respond to requests.

ObjectDescription
ConnectionDefinitions set by MLOps engineers that are used by other Wallaroo users for connection information to a data source. Usually paired with orchestrations.

This tutorial demonstrates the following.

  1. Create a connection is defined with information such as username, connection URL, tokens, etc.
  2. One or more connections are applied to a workspace for users to implement in their code or orchestrations.
  3. Perform sample inferences using the data connection.

Tutorial Required Libraries

The following libraries are required for this tutorial, and included by default in a Wallaroo instance’s JupyterHub service.

  • IMPORTANT NOTE: These libraries are already installed in the Wallaroo JupyterHub service. Do not uninstall and reinstall the Wallaroo SDK with the command below.

  • wallaroo: The Wallaroo SDK.

  • pandas: The pandas data analysis library.

  • pyarrow: The Apache Arrow Python library.

The specific versions used are set in the file ./resources/requirements.txt. Supported libraries are automatically installed with the pypi or conda commands. For example, from the root of this tutorials folder:

pip install -r ./resources/requirements.txt

Initialization

The first step is to connect to a Wallaroo instance. We’ll load the libraries and set our client connection settings

Workspace, Model and Pipeline Setup

For this tutorial, we’ll create a workspace, upload our sample model and deploy a pipeline. We’ll perform some quick sample inferences to verify that everything it working.

import wallaroo
from wallaroo.object import EntityNotFoundError

# to display dataframe tables
from IPython.display import display
# used to display dataframe information without truncating
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
import pyarrow as pa

import requests

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

# Login through local Wallaroo instance

wl = wallaroo.Client()
# Setting variables for later steps

workspace_name = f'orchestrationworkspace'
pipeline_name = f'orchestrationpipeline'
model_name = f'orchestrationmodel'
model_file_name = './models/rf_model.onnx'
connection_name = f'houseprice_arrow_table'

Create the Workspace and Pipeline

We’ll now create our workspace and pipeline for the tutorial. If this tutorial has been run previously, then this will retrieve the existing ones with the assumption they’re for us with this tutorial.

We’ll set the retrieved workspace as the current workspace in the SDK, so all commands will default to that workspace.

workspace = wl.get_workspace(name=workspace_name, create_if_not_exist=True)
wl.set_current_workspace(workspace)

pipeline = wl.build_pipeline(pipeline_name)

Upload the Model and Deploy Pipeline

We’ll upload our model into our sample workspace, then add it as a pipeline step before deploying the pipeline to it’s ready to accept inference requests.

# Upload the model

housing_model_control = (wl.upload_model(model_name, 
                                         model_file_name, 
                                         framework=wallaroo.framework.Framework.ONNX)
                                         .configure(tensor_fields=["tensor"])
                        )

# Add the model as a pipeline step

pipeline.add_model_step(housing_model_control)
nameorchestrationpipeline
created2024-12-09 21:09:36.107470+00:00
last_updated2024-12-09 21:10:01.573281+00:00
deployedTrue
workspace_id12
workspace_nameorchestrationworkspace
archx86
accelnone
tags
versions0492779b-3b18-4d29-a87c-b4cfaae53c5a, 67aed035-88b0-42a7-9a30-3d85c2617d37, 7f62cc3a-2f7f-41fc-bdc8-f598fe6a1926
stepsorchestrationmodel
publishedFalse
#deploy the pipeline
pipeline.deploy(wait_for_status=False)
Deployment initiated for orchestrationpipeline. Please check pipeline status.
nameorchestrationpipeline
created2024-12-09 21:09:36.107470+00:00
last_updated2024-12-09 21:10:04.259743+00:00
deployedTrue
workspace_id12
workspace_nameorchestrationworkspace
archx86
accelnone
tags
versionsff669947-df11-4971-acd8-9f71c31e51a0, 0492779b-3b18-4d29-a87c-b4cfaae53c5a, 67aed035-88b0-42a7-9a30-3d85c2617d37, 7f62cc3a-2f7f-41fc-bdc8-f598fe6a1926
stepsorchestrationmodel
publishedFalse
# check the pipeline status before performing an inference

import time

while pipeline.status()['status'] != 'Running':
   time.sleep(15)

pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.28.0.8',
   'name': 'engine-8574cf699d-4nwg8',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'orchestrationpipeline',
      'status': 'Running',
      'version': 'ff669947-df11-4971-acd8-9f71c31e51a0'}]},
   'model_statuses': {'models': [{'model_version_id': 7,
      'name': 'orchestrationmodel',
      'sha': 'e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6',
      'status': 'Running',
      'version': 'd2ff6835-e736-489e-8b4b-c3aa3684a96a'}]}}],
 'engine_lbs': [{'ip': '10.28.0.7',
   'name': 'engine-lb-6676794678-g6j6q',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': []}

Sample Inferences

We’ll perform some quick sample inferences using an Apache Arrow table as the input. Once that’s finished, we’ll undeploy the pipeline and return the resources back to the Wallaroo instance.

# sample inferences

batch_inferences = pipeline.infer_from_file('./data/xtest-1k.arrow')

large_inference_result =  batch_inferences.to_pandas()
display(large_inference_result.head(20))
timein.tensorout.variableanomaly.count
02024-12-09 21:10:09.099[4.0, 2.5, 2900.0, 5505.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2900.0, 0.0, 47.6063, -122.02, 2970.0, 5251.0, 12.0, 0.0, 0.0][718013.75]0
12024-12-09 21:10:09.099[2.0, 2.5, 2170.0, 6361.0, 1.0, 0.0, 2.0, 3.0, 8.0, 2170.0, 0.0, 47.7109, -122.017, 2310.0, 7419.0, 6.0, 0.0, 0.0][615094.56]0
22024-12-09 21:10:09.099[3.0, 2.5, 1300.0, 812.0, 2.0, 0.0, 0.0, 3.0, 8.0, 880.0, 420.0, 47.5893, -122.317, 1300.0, 824.0, 6.0, 0.0, 0.0][448627.72]0
32024-12-09 21:10:09.099[4.0, 2.5, 2500.0, 8540.0, 2.0, 0.0, 0.0, 3.0, 9.0, 2500.0, 0.0, 47.5759, -121.994, 2560.0, 8475.0, 24.0, 0.0, 0.0][758714.2]0
42024-12-09 21:10:09.099[3.0, 1.75, 2200.0, 11520.0, 1.0, 0.0, 0.0, 4.0, 7.0, 2200.0, 0.0, 47.7659, -122.341, 1690.0, 8038.0, 62.0, 0.0, 0.0][513264.7]0
52024-12-09 21:10:09.099[3.0, 2.0, 2140.0, 4923.0, 1.0, 0.0, 0.0, 4.0, 8.0, 1070.0, 1070.0, 47.6902, -122.339, 1470.0, 4923.0, 86.0, 0.0, 0.0][668288.0]0
62024-12-09 21:10:09.099[4.0, 3.5, 3590.0, 5334.0, 2.0, 0.0, 2.0, 3.0, 9.0, 3140.0, 450.0, 47.6763, -122.267, 2100.0, 6250.0, 9.0, 0.0, 0.0][1004846.5]0
72024-12-09 21:10:09.099[3.0, 2.0, 1280.0, 960.0, 2.0, 0.0, 0.0, 3.0, 9.0, 1040.0, 240.0, 47.602, -122.311, 1280.0, 1173.0, 0.0, 0.0, 0.0][684577.2]0
82024-12-09 21:10:09.099[4.0, 2.5, 2820.0, 15000.0, 2.0, 0.0, 0.0, 4.0, 9.0, 2820.0, 0.0, 47.7255, -122.101, 2440.0, 15000.0, 29.0, 0.0, 0.0][727898.1]0
92024-12-09 21:10:09.099[3.0, 2.25, 1790.0, 11393.0, 1.0, 0.0, 0.0, 3.0, 8.0, 1790.0, 0.0, 47.6297, -122.099, 2290.0, 11894.0, 36.0, 0.0, 0.0][559631.1]0
102024-12-09 21:10:09.099[3.0, 1.5, 1010.0, 7683.0, 1.5, 0.0, 0.0, 5.0, 7.0, 1010.0, 0.0, 47.72, -122.318, 1550.0, 7271.0, 61.0, 0.0, 0.0][340764.53]0
112024-12-09 21:10:09.099[3.0, 2.0, 1270.0, 1323.0, 3.0, 0.0, 0.0, 3.0, 8.0, 1270.0, 0.0, 47.6934, -122.342, 1330.0, 1323.0, 8.0, 0.0, 0.0][442168.06]0
122024-12-09 21:10:09.099[4.0, 1.75, 2070.0, 9120.0, 1.0, 0.0, 0.0, 4.0, 7.0, 1250.0, 820.0, 47.6045, -122.123, 1650.0, 8400.0, 57.0, 0.0, 0.0][630865.6]0
132024-12-09 21:10:09.099[4.0, 1.0, 1620.0, 4080.0, 1.5, 0.0, 0.0, 3.0, 7.0, 1620.0, 0.0, 47.6696, -122.324, 1760.0, 4080.0, 91.0, 0.0, 0.0][559631.1]0
142024-12-09 21:10:09.099[4.0, 3.25, 3990.0, 9786.0, 2.0, 0.0, 0.0, 3.0, 9.0, 3990.0, 0.0, 47.6784, -122.026, 3920.0, 8200.0, 10.0, 0.0, 0.0][909441.1]0
152024-12-09 21:10:09.099[4.0, 2.0, 1780.0, 19843.0, 1.0, 0.0, 0.0, 3.0, 7.0, 1780.0, 0.0, 47.4414, -122.154, 2210.0, 13500.0, 52.0, 0.0, 0.0][313096.0]0
162024-12-09 21:10:09.099[4.0, 2.5, 2130.0, 6003.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2130.0, 0.0, 47.4518, -122.12, 1940.0, 4529.0, 11.0, 0.0, 0.0][404040.8]0
172024-12-09 21:10:09.099[3.0, 1.75, 1660.0, 10440.0, 1.0, 0.0, 0.0, 3.0, 7.0, 1040.0, 620.0, 47.4448, -121.77, 1240.0, 10380.0, 36.0, 0.0, 0.0][292859.5]0
182024-12-09 21:10:09.099[3.0, 2.5, 2110.0, 4118.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2110.0, 0.0, 47.3878, -122.153, 2110.0, 4044.0, 25.0, 0.0, 0.0][338357.88]0
192024-12-09 21:10:09.099[4.0, 2.25, 2200.0, 11250.0, 1.5, 0.0, 0.0, 5.0, 7.0, 1300.0, 900.0, 47.6845, -122.201, 2320.0, 10814.0, 94.0, 0.0, 0.0][682284.6]0

Create Wallaroo Connection

Connections are created at the Wallaroo instance level, typically by a MLOps or DevOps engineer, then applied to a workspace.

For this section:

  1. We will create a sample connection that just has a URL to the same Arrow table file we used in the previous step.
  2. We’ll apply the data connection to the workspace above.
  3. For a quick demonstration, we’ll use the connection to retrieve the Arrow table file and use it for a quick sample inference.

Create Connection

Connections are created with the Wallaroo client command create_connection with the following parameters.

ParameterTypeDescription
namestring (Required)The name of the connection. This must be unique - if submitting the name of an existing connection it will return an error.
typestring (Required)The user defined type of connection.
detailsDict (Requires)User defined configuration details for the data connection. These can be {'username':'dataperson', 'password':'datapassword', 'port': 3339}, or {'token':'abcde123==', 'host':'example.com', 'port:1234'}, or other user defined combinations.

We’ll create the connection named houseprice_arrow_table, set it to the type HTTPFILE, and provide the details as 'host':'https://github.com/WallarooLabs/Wallaroo_Tutorials/raw/main/wallaroo-testing-tutorials/houseprice-saga/data/xtest-1k.arrow' - the location for our sample Arrow table inference input.

wl.create_connection(connection_name, 
                  "HTTPFILE", 
                  {'host':'https://github.com/WallarooLabs/Wallaroo_Tutorials/raw/refs/heads/wallaroo2025.1.2_tutorials/wallaroo-model-operations-tutorials/automation/orchestration-sdk-comprehensive-tutorial/data/xtest-1k.arrow'}
                  )
FieldValue
Namehouseprice_arrow_table
Connection TypeHTTPFILE
Details*****
Created At2024-12-09T21:20:26.268422+00:00
Linked Workspaces[]

List Data Connections

The Wallaroo Client list_connections() method lists all connections for the Wallaroo instance.

wl.list_connections()
nameconnection typedetailscreated atlinked workspaces
external_inference_connection_sample_2HTTP*****2024-12-09T19:48:44.311264+00:00['simpleorchestrationworkspace2']
external_inference_connection_sampleHTTP*****2024-12-09T20:58:43.896309+00:00['simpleorchestrationworkspace']
houseprice_arrow_tableHTTPFILE*****2024-12-09T21:20:26.268422+00:00[]

Add Connection to Workspace

The method Workspace add_connection(connection_name) adds a Data Connection to a workspace, and takes the following parameters.

ParameterTypeDescription
namestring (Required)The name of the Data Connection

We’ll add this connection to our sample workspace.

workspace.add_connection(connection_name)

Get Connection

Connections are retrieved by the Wallaroo Client get_connection(name) method.

connection = wl.get_connection(connection_name)

Connection Details

The Connection method details() retrieves a the connection details() as a dict.

display(connection.details())
{'host': 'https://github.com/WallarooLabs/Wallaroo_Tutorials/raw/refs/heads/wallaroo2025.1.2_tutorials/wallaroo-model-operations-tutorials/automation/orchestration-sdk-comprehensive-tutorial/data/xtest-1k.arrow'}

Using a Connection Example

For this example, the connection will be used to retrieve the Apache Arrow file referenced in the connection, and use that to turn it into an Apache Arrow table, then use that for a sample inference.

connection.details()['host']
'https://github.com/WallarooLabs/Wallaroo_Tutorials/raw/refs/heads/wallaroo2025.1.2_tutorials/wallaroo-model-operations-tutorials/automation/orchestration-sdk-comprehensive-tutorial/data/xtest-1k.arrow'
# Deploy the pipeline 
pipeline.deploy()

# Retrieve the file
# set accept as apache arrow table
headers = {
    'Accept': 'application/vnd.apache.arrow.file'
}

response = requests.get(
                    connection.details()['host'], 
                    headers=headers
                )

# Arrow table is retrieved 
with pa.ipc.open_file(response.content) as reader:
    arrow_table = reader.read_all()

results = pipeline.infer(arrow_table)

result_table = results.to_pandas()
display(result_table.head(20))
timein.tensorout.variableanomaly.count
02024-12-09 21:20:35.324[4.0, 2.5, 2900.0, 5505.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2900.0, 0.0, 47.6063, -122.02, 2970.0, 5251.0, 12.0, 0.0, 0.0][718013.75]0
12024-12-09 21:20:35.324[2.0, 2.5, 2170.0, 6361.0, 1.0, 0.0, 2.0, 3.0, 8.0, 2170.0, 0.0, 47.7109, -122.017, 2310.0, 7419.0, 6.0, 0.0, 0.0][615094.56]0
22024-12-09 21:20:35.324[3.0, 2.5, 1300.0, 812.0, 2.0, 0.0, 0.0, 3.0, 8.0, 880.0, 420.0, 47.5893, -122.317, 1300.0, 824.0, 6.0, 0.0, 0.0][448627.72]0
32024-12-09 21:20:35.324[4.0, 2.5, 2500.0, 8540.0, 2.0, 0.0, 0.0, 3.0, 9.0, 2500.0, 0.0, 47.5759, -121.994, 2560.0, 8475.0, 24.0, 0.0, 0.0][758714.2]0
42024-12-09 21:20:35.324[3.0, 1.75, 2200.0, 11520.0, 1.0, 0.0, 0.0, 4.0, 7.0, 2200.0, 0.0, 47.7659, -122.341, 1690.0, 8038.0, 62.0, 0.0, 0.0][513264.7]0
52024-12-09 21:20:35.324[3.0, 2.0, 2140.0, 4923.0, 1.0, 0.0, 0.0, 4.0, 8.0, 1070.0, 1070.0, 47.6902, -122.339, 1470.0, 4923.0, 86.0, 0.0, 0.0][668288.0]0
62024-12-09 21:20:35.324[4.0, 3.5, 3590.0, 5334.0, 2.0, 0.0, 2.0, 3.0, 9.0, 3140.0, 450.0, 47.6763, -122.267, 2100.0, 6250.0, 9.0, 0.0, 0.0][1004846.5]0
72024-12-09 21:20:35.324[3.0, 2.0, 1280.0, 960.0, 2.0, 0.0, 0.0, 3.0, 9.0, 1040.0, 240.0, 47.602, -122.311, 1280.0, 1173.0, 0.0, 0.0, 0.0][684577.2]0
82024-12-09 21:20:35.324[4.0, 2.5, 2820.0, 15000.0, 2.0, 0.0, 0.0, 4.0, 9.0, 2820.0, 0.0, 47.7255, -122.101, 2440.0, 15000.0, 29.0, 0.0, 0.0][727898.1]0
92024-12-09 21:20:35.324[3.0, 2.25, 1790.0, 11393.0, 1.0, 0.0, 0.0, 3.0, 8.0, 1790.0, 0.0, 47.6297, -122.099, 2290.0, 11894.0, 36.0, 0.0, 0.0][559631.1]0
102024-12-09 21:20:35.324[3.0, 1.5, 1010.0, 7683.0, 1.5, 0.0, 0.0, 5.0, 7.0, 1010.0, 0.0, 47.72, -122.318, 1550.0, 7271.0, 61.0, 0.0, 0.0][340764.53]0
112024-12-09 21:20:35.324[3.0, 2.0, 1270.0, 1323.0, 3.0, 0.0, 0.0, 3.0, 8.0, 1270.0, 0.0, 47.6934, -122.342, 1330.0, 1323.0, 8.0, 0.0, 0.0][442168.06]0
122024-12-09 21:20:35.324[4.0, 1.75, 2070.0, 9120.0, 1.0, 0.0, 0.0, 4.0, 7.0, 1250.0, 820.0, 47.6045, -122.123, 1650.0, 8400.0, 57.0, 0.0, 0.0][630865.6]0
132024-12-09 21:20:35.324[4.0, 1.0, 1620.0, 4080.0, 1.5, 0.0, 0.0, 3.0, 7.0, 1620.0, 0.0, 47.6696, -122.324, 1760.0, 4080.0, 91.0, 0.0, 0.0][559631.1]0
142024-12-09 21:20:35.324[4.0, 3.25, 3990.0, 9786.0, 2.0, 0.0, 0.0, 3.0, 9.0, 3990.0, 0.0, 47.6784, -122.026, 3920.0, 8200.0, 10.0, 0.0, 0.0][909441.1]0
152024-12-09 21:20:35.324[4.0, 2.0, 1780.0, 19843.0, 1.0, 0.0, 0.0, 3.0, 7.0, 1780.0, 0.0, 47.4414, -122.154, 2210.0, 13500.0, 52.0, 0.0, 0.0][313096.0]0
162024-12-09 21:20:35.324[4.0, 2.5, 2130.0, 6003.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2130.0, 0.0, 47.4518, -122.12, 1940.0, 4529.0, 11.0, 0.0, 0.0][404040.8]0
172024-12-09 21:20:35.324[3.0, 1.75, 1660.0, 10440.0, 1.0, 0.0, 0.0, 3.0, 7.0, 1040.0, 620.0, 47.4448, -121.77, 1240.0, 10380.0, 36.0, 0.0, 0.0][292859.5]0
182024-12-09 21:20:35.324[3.0, 2.5, 2110.0, 4118.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2110.0, 0.0, 47.3878, -122.153, 2110.0, 4044.0, 25.0, 0.0, 0.0][338357.88]0
192024-12-09 21:20:35.324[4.0, 2.25, 2200.0, 11250.0, 1.5, 0.0, 0.0, 5.0, 7.0, 1300.0, 900.0, 47.6845, -122.201, 2320.0, 10814.0, 94.0, 0.0, 0.0][682284.6]0

Remove Connection from Workspace

The Workspace method remove_connection(connection_name) removes the connection from the workspace, but does not delete the connection from the Wallaroo instance. This method takes the following parameters.

ParameterTypeDescription
nameString (Required)The name of the connection to be removed

The previous connection will be removed from the workspace, then the workspace connections displayed to verify it has been removed.

workspace.remove_connection(connection_name)

display(workspace.list_connections())

(no connections)

Delete Connection

The Connection method delete_connection() removes the connection from the Wallaroo instance, and all attachments in workspaces they were connected to.

connection.delete_connection()

wl.list_connections()
nameconnection typedetailscreated atlinked workspaces
external_inference_connection_sample_2HTTP*****2024-12-09T19:48:44.311264+00:00['simpleorchestrationworkspace2']
external_inference_connection_sampleHTTP*****2024-12-09T20:58:43.896309+00:00['simpleorchestrationworkspace']

Cleaning Up

With the tutorial complete we will undeploy the pipeline and ensure the resources are returned back to the Wallaroo instance.

pipeline.undeploy()
nameorchestrationpipeline
created2024-12-09 21:09:36.107470+00:00
last_updated2024-12-09 21:20:33.470614+00:00
deployedFalse
workspace_id12
workspace_nameorchestrationworkspace
archx86
accelnone
tags
versionsd507b058-ca47-41a0-a934-eb6fb401308a, 64337279-4d6a-450d-9ddb-cb2d2299c5ce, 7d9d4dc7-aa29-4e44-a349-a5e2a6d975f7, ff669947-df11-4971-acd8-9f71c31e51a0, 0492779b-3b18-4d29-a87c-b4cfaae53c5a, 67aed035-88b0-42a7-9a30-3d85c2617d37, 7f62cc3a-2f7f-41fc-bdc8-f598fe6a1926
stepsorchestrationmodel
publishedFalse