Aloha Quick Tutorial

The Aloha Quick Start Guide demonstrates how to use Wallaroo to determine malicious web sites from their URL.

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Aloha Demo

In this notebook we will walk through a simple pipeline deployment to inference on a model. For this example we will be using an open source model that uses an Aloha CNN LSTM model for classifying Domain names as being either legitimate or being used for nefarious purposes such as malware distribution.

Prerequisites

  • An installed Wallaroo instance.
  • The following Python libraries installed:
    • os
    • wallaroo: The Wallaroo SDK. Included with the Wallaroo JupyterHub service by default.
    • pandas: Pandas, mainly used for Pandas DataFrame
    • pyarrow: PyArrow for Apache Arrow support

Tutorial Goals

For our example, we will perform the following:

  • Create a workspace for our work.
  • Upload the Aloha model.
  • Create a pipeline that can ingest our submitted data, submit it to the model, and export the results
  • Run a sample inference through our pipeline by loading a file
  • Run a sample inference through our pipeline’s URL and store the results in a file.

All sample data and models are available through the Wallaroo Quick Start Guide Samples repository.

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

import wallaroo
from wallaroo.object import EntityNotFoundError

# to display dataframe tables
from IPython.display import display
# used to display dataframe information without truncating
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
import pyarrow as pa
# Login through local Wallaroo instance

wl = wallaroo.Client()

Create the Workspace

We will create a workspace to work in and call it the “alohaworkspace”, then set it as current workspace environment. We’ll also create our pipeline in advance as alohapipeline. The model name and the model file will be specified for use in later steps.

workspace_name = f'alohaworkspace'
pipeline_name = f'alohapipeline'
model_name = f'alohamodel'
model_file_name = './alohacnnlstm.zip'
workspace = wl.get_workspace(name=workspace_name, create_if_not_exist=True)

wl.set_current_workspace(workspace)

aloha_pipeline = wl.build_pipeline(pipeline_name)
aloha_pipeline
namealohapipeline
created2024-12-09 21:42:13.603663+00:00
last_updated2024-12-09 21:42:13.603663+00:00
deployed(none)
workspace_id13
workspace_namealohaworkspace
archNone
accelNone
tags
versionsbce5c7d8-d70d-407e-af2a-d5ccbcd499a7
steps
publishedFalse

We can verify the workspace is created the current default workspace with the get_current_workspace() command.

wl.get_current_workspace()
{'name': 'alohaworkspace', 'id': 13, 'archived': False, 'created_by': '51aefe14-5431-4c0f-9427-10b5f5d75def', 'created_at': '2024-12-09T21:42:13.449487+00:00', 'models': [], 'pipelines': [{'name': 'alohapipeline', 'create_time': datetime.datetime(2024, 12, 9, 21, 42, 13, 603663, tzinfo=tzutc()), 'definition': '[]'}]}

Upload the Models

Now we will upload our models. Note that for this example we are applying the model from a .ZIP file. The Aloha model is a protobuf file that has been defined for evaluating web pages, and we will configure it to use data in the tensorflow format.

from wallaroo.framework import Framework

model = wl.upload_model(model_name, 
                        model_file_name,
                        framework=Framework.TENSORFLOW
                        )

Deploy a model

Now that we have a model that we want to use we will create a deployment for it.

We will tell the deployment we are using a tensorflow model and give the deployment name and the configuration we want for the deployment.

To do this, we’ll create our pipeline that can ingest the data, pass the data to our Aloha model, and give us a final output. We’ll call our pipeline aloha-test-demo, then deploy it so it’s ready to receive data. The deployment process usually takes about 45 seconds.

  • Note: If you receive an error that the pipeline could not be deployed because there are not enough resources, undeploy any other pipelines and deploy this one again. This command can quickly undeploy all pipelines to regain resources. We recommend not running this command in a production environment since it will cancel any running pipelines:
for p in wl.list_pipelines(): p.undeploy()
aloha_pipeline.add_model_step(model)
namealohapipeline
created2024-12-09 21:42:13.603663+00:00
last_updated2024-12-09 21:42:13.603663+00:00
deployed(none)
workspace_id13
workspace_namealohaworkspace
archNone
accelNone
tags
versionsbce5c7d8-d70d-407e-af2a-d5ccbcd499a7
steps
publishedFalse
deploy_config = wallaroo.DeploymentConfigBuilder().replica_count(1).cpus(0.5).memory("1Gi").build()
aloha_pipeline.deploy(deployment_config=deploy_config, wait_for_status=False)
Deployment initiated for alohapipeline. Please check pipeline status.
namealohapipeline
created2024-12-09 21:42:13.603663+00:00
last_updated2024-12-09 21:42:17.268309+00:00
deployedTrue
workspace_id13
workspace_namealohaworkspace
archx86
accelnone
tags
versionsc7d48fe3-4072-4168-bb0a-6bd6b9154cf6, bce5c7d8-d70d-407e-af2a-d5ccbcd499a7
stepsalohamodel
publishedFalse

We can verify that the pipeline is running and list what models are associated with it.

# check the pipeline status before performing an inference

import time

if aloha_pipeline.status()['status'] != 'Running':
   time.sleep(15)

aloha_pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.28.0.10',
   'name': 'engine-866bdccf8-pmsxk',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'alohapipeline',
      'status': 'Running',
      'version': 'c7d48fe3-4072-4168-bb0a-6bd6b9154cf6'}]},
   'model_statuses': {'models': [{'model_version_id': 9,
      'name': 'alohamodel',
      'sha': 'd71d9ffc61aaac58c2b1ed70a2db13d1416fb9d3f5b891e5e4e2e97180fe22f8',
      'status': 'Running',
      'version': 'fd34d80c-8ff2-4ebb-882b-42b48f5b3696'}]}}],
 'engine_lbs': [{'ip': '10.28.0.9',
   'name': 'engine-lb-6676794678-rmgj7',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': []}

Inferences

Infer 1 row

Now that the pipeline is deployed and our Aloha model is in place, we’ll perform a smoke test to verify the pipeline is up and running properly. We’ll use the infer_from_file command to load a single encoded URL into the inference engine and print the results back out.

The result should tell us that the tokenized URL is legitimate (0) or fraud (1). This sample data should return close to 1 in out.main.

smoke_test = pd.DataFrame.from_records(
    [
    {
        "text_input":[
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            0,
            28,
            16,
            32,
            23,
            29,
            32,
            30,
            19,
            26,
            17
        ]
    }
]
)

result = aloha_pipeline.infer(smoke_test)
display(result.loc[:, ["time","out.main"]])
timeout.main
02024-12-09 21:42:44.098[0.997564]

Infer From File

This time, we’ll give it a bigger set of data to infer. ./data/data_1k.arrow is an Apache Arrow table with 1,000 records in it. Once submitted, we’ll turn the result into a DataFrame and display the first five results.

result = aloha_pipeline.infer_from_file('./data/data_1k.arrow')
display(result.to_pandas().loc[:, ["time","out.main"]])
timeout.main
02024-12-09 21:42:44.882[0.997564]
12024-12-09 21:42:44.882[0.9885122]
22024-12-09 21:42:44.882[0.9993358]
32024-12-09 21:42:44.882[0.99999857]
42024-12-09 21:42:44.882[0.9984837]
.........
9952024-12-09 21:42:44.882[0.9999754]
9962024-12-09 21:42:44.882[0.9999727]
9972024-12-09 21:42:44.882[0.66066873]
9982024-12-09 21:42:44.882[0.9998954]
9992024-12-09 21:42:44.882[0.99999803]

1000 rows × 2 columns

outputs =  result.to_pandas()
display(outputs.loc[:5, ["time","out.main"]])
timeout.main
02024-12-09 21:42:44.882[0.997564]
12024-12-09 21:42:44.882[0.9885122]
22024-12-09 21:42:44.882[0.9993358]
32024-12-09 21:42:44.882[0.99999857]
42024-12-09 21:42:44.882[0.9984837]
52024-12-09 21:42:44.882[1.0]

Batch Inference

Now that our smoke test is successful, let’s really give it some data. We have two inference files we can use:

  • data_1k.arrow: Contains 10,000 inferences
  • data_25k.arrow: Contains 25,000 inferences

When Apache Arrow tables are submitted to a Wallaroo Pipeline, the inference is processed natively as an Arrow table, and the results are returned as an Arrow table. This allows for faster data processing than with JSON files or DataFrame objects.

We’ll pipe the data_25k.arrow file through the aloha_pipeline deployment URL, and place the results in a file named response.arrow. We’ll also display the time this takes. Note that for larger batches of 50,000 inferences or more can be difficult to view in Jupyter Hub because of its size, so we’ll only display the first five rows.

  • IMPORTANT NOTE: The _deployment._url() method will return an internal URL when using Python commands from within the Wallaroo instance - for example, the Wallaroo JupyterHub service. When connecting via an external connection, _deployment._url() returns an external URL. External URL connections requires the authentication be included in the HTTP request, and that Model Endpoints Guide external endpoints are enabled in the Wallaroo configuration options.
inference_url = aloha_pipeline._deployment._url()
inference_url
'https://doc-test.wallarooexample.ai/v1/api/pipelines/infer/alohapipeline-5/alohapipeline'
token = wl.auth.auth_header()['Authorization']
token
'Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJaZV9QdXJLMUwwUnBKTllTSVBwMUdzLTlSeTdkck00QmpiVE9USVRaWXRVIn0.eyJleHAiOjE3MzM3ODQzMDksImlhdCI6MTczMzc4NDI0OSwianRpIjoiNjg4YTA1YjUtNDkwNS00MGQ1LWE3MGMtNGRlYWM0NzllNDU3IiwiaXNzIjoiaHR0cHM6Ly9kb2MtdGVzdC53YWxsYXJvb2NvbW11bml0eS5uaW5qYS9hdXRoL3JlYWxtcy9tYXN0ZXIiLCJhdWQiOlsibWFzdGVyLXJlYWxtIiwiYWNjb3VudCJdLCJzdWIiOiI1MWFlZmUxNC01NDMxLTRjMGYtOTQyNy0xMGI1ZjVkNzVkZWYiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJzZGstY2xpZW50Iiwic2Vzc2lvbl9zdGF0ZSI6ImFjY2M2ODI3LWM3M2EtNGQzYi1iOWY4LTE2YmFjYjYzMWIxOSIsImFjciI6IjEiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiY3JlYXRlLXJlYWxtIiwiZGVmYXVsdC1yb2xlcy1tYXN0ZXIiLCJvZmZsaW5lX2FjY2VzcyIsImFkbWluIiwidW1hX2F1dGhvcml6YXRpb24iXX0sInJlc291cmNlX2FjY2VzcyI6eyJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsidmlldy1yZWFsbSIsInZpZXctaWRlbnRpdHktcHJvdmlkZXJzIiwibWFuYWdlLWlkZW50aXR5LXByb3ZpZGVycyIsImltcGVyc29uYXRpb24iLCJjcmVhdGUtY2xpZW50IiwibWFuYWdlLXVzZXJzIiwicXVlcnktcmVhbG1zIiwidmlldy1hdXRob3JpemF0aW9uIiwicXVlcnktY2xpZW50cyIsInF1ZXJ5LXVzZXJzIiwibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hdXRob3JpemF0aW9uIiwibWFuYWdlLWNsaWVudHMiLCJxdWVyeS1ncm91cHMiXX0sImFjY291bnQiOnsicm9sZXMiOlsibWFuYWdlLWFjY291bnQiLCJtYW5hZ2UtYWNjb3VudC1saW5rcyIsInZpZXctcHJvZmlsZSJdfX0sInNjb3BlIjoicHJvZmlsZSBvcGVuaWQgZW1haWwiLCJzaWQiOiJhY2NjNjgyNy1jNzNhLTRkM2ItYjlmOC0xNmJhY2I2MzFiMTkiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiaHR0cHM6Ly9oYXN1cmEuaW8vand0L2NsYWltcyI6eyJ4LWhhc3VyYS11c2VyLWlkIjoiNTFhZWZlMTQtNTQzMS00YzBmLTk0MjctMTBiNWY1ZDc1ZGVmIiwieC1oYXN1cmEtdXNlci1lbWFpbCI6ImpvaG4uaGFuc2FyaWNrQHdhbGxhcm9vLmFpIiwieC1oYXN1cmEtZGVmYXVsdC1yb2xlIjoiYWRtaW5fdXNlciIsIngtaGFzdXJhLWFsbG93ZWQtcm9sZXMiOlsidXNlciIsImFkbWluX3VzZXIiXSwieC1oYXN1cmEtdXNlci1ncm91cHMiOiJ7fSJ9LCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJqb2huLmhhbnNhcmlja0B3YWxsYXJvby5haSIsImVtYWlsIjoiam9obi5oYW5zYXJpY2tAd2FsbGFyb28uYWkifQ.uKnrQXu21oEJ6ArGUtjkPOXmMwfw1jIUJBxrbjuts3wJDeFTo_yZxEDFfRd8KwkVU22B5ePoVsGTcWqveGBycXOvRbDeIMmIafd6gikvSQgqukPajyXObpDWv1fFSXJ6V-OeDN_rox3L0Qup8Dl1DiQTWmpiiYR3qphERCtZN3WO5PW_UjXG9OaJpgpqPM8Tr-oAsHJl_GbzEoPUDqLkLll6pjNJ7ER7s8IXtilSUHiLUFPpKiKQQI591tZ1RBoTV1kuiZ0FdIu9XlZO975VkFY7Tzgt1Fsfsbh1xPWZuO8dHwNoxv9EYQL0ZCeXdZkl0PQyaMaH8VtSaNhrojU63w'
dataFile="./data/data_25k.arrow"
contentType="application/vnd.apache.arrow.file"
!curl -X POST {inference_url} -H \
    "Authorization: {token}" -H \
    "Content-Type:{contentType}" \
    --data-binary @{dataFile} > curl_response.df
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4875k  100   171  100 4874k     41  1172k  0:00:04  0:00:04 --:--:-- 1173k
cc_data_from_file =  pd.read_json('./curl_response.df', orient="records")
cc_data_from_file

#display(cc_data_from_file.head(5).loc[:5, ["time","out"]])
---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

/var/folders/rs/yt_dh9xn6y39_h0_jth1mjb40000gq/T/ipykernel_67993/1471128229.py in <module>
—-> 1 cc_data_from_file = pd.read_json(’./curl_response.df’, orient="records")
2 cc_data_from_file
3
4 #display(cc_data_from_file.head(5).loc[:5, ["time","out"]])

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/io/json/_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options, dtype_backend, engine)
813 return json_reader
814 else:
–> 815 return json_reader.read()
816
817

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/io/json/_json.py in read(self)
1023 obj = self._get_object_parser(self._combine_lines(data_lines))
1024 else:
-> 1025 obj = self._get_object_parser(self.data)
1026 if self.dtype_backend is not lib.no_default:
1027 return obj.convert_dtypes(

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/io/json/_json.py in _get_object_parser(self, json)
1049 obj = None
1050 if typ == "frame":
-> 1051 obj = FrameParser(json, **kwargs).parse()
1052
1053 if typ == "series" or obj is None:

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/io/json/_json.py in parse(self)
1185 @final
1186 def parse(self):
-> 1187 self._parse()
1188
1189 if self.obj is None:

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/io/json/_json.py in _parse(self)
1424 self.obj = parse_table_schema(json, precise_float=self.precise_float)
1425 else:
-> 1426 self.obj = DataFrame(
1427 ujson_loads(json, precise_float=self.precise_float), dtype=None
1428 )

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/core/frame.py in init(self, data, index, columns, dtype, copy)
765 elif isinstance(data, dict):
766 # GH#38939 de facto copy defaults to False only in non-dict cases
–> 767 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
768 elif isinstance(data, ma.MaskedArray):
769 from numpy.ma import mrecords

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/core/internals/construction.py in dict_to_mgr(data, index, columns, dtype, typ, copy)
501 arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
502
–> 503 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
504
505

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/core/internals/construction.py in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
112 # figure out the index, if necessary
113 if index is None:
–> 114 index = _extract_index(arrays)
115 else:
116 index = ensure_index(index)

~/.virtualenvs/wallaroosdk2024.4/lib/python3.10/site-packages/pandas/core/internals/construction.py in _extract_index(data)
665
666 if not indexes and not raw_lengths:
–> 667 raise ValueError("If using all scalar values, you must pass an index")
668
669 if have_series:

ValueError: If using all scalar values, you must pass an index

Undeploy Pipeline

When finished with our tests, we will undeploy the pipeline so we have the Kubernetes resources back for other tasks. Note that if the deployment variable is unchanged aloha_pipeline.deploy() will restart the inference engine in the same configuration as before.

aloha_pipeline.undeploy()
namealohapipeline
created2024-12-09 21:42:13.603663+00:00
last_updated2024-12-09 21:42:17.268309+00:00
deployedFalse
workspace_id13
workspace_namealohaworkspace
archx86
accelnone
tags
versionsc7d48fe3-4072-4168-bb0a-6bd6b9154cf6, bce5c7d8-d70d-407e-af2a-d5ccbcd499a7
stepsalohamodel
publishedFalse