IMDB Tutorial

The IMDB Tutorial demonstrates how to use Wallaroo to determine if reviews are positive or negative.

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

IMDB Sample

The following example demonstrates how to use Wallaroo with chained models. In this example, we will be using information from the IMDB (Internet Movie DataBase) with a sentiment model to detect whether a given review is positive or negative. Imagine using this to automatically scan Tweets regarding your product and finding either customers who need help or have nice things to say about your product.

Note that this example is considered a “toy” model - only the first 100 words in the review were tokenized, and the embedding is very small.

The following example is based on the Large Movie Review Dataset, and sample data can be downloaded from the aclIMDB dataset.

Prerequisites

  • An installed Wallaroo instance.
  • The following Python libraries installed:
    • os
    • wallaroo: The Wallaroo SDK. Included with the Wallaroo JupyterHub service by default.
    • pandas: Pandas, mainly used for Pandas DataFrame
    • pyarrow: PyArrow for Apache Arrow support
    • polars: Polars for DataFrame with native Apache Arrow support
import wallaroo
from wallaroo.object import EntityNotFoundError

# to display dataframe tables
from IPython.display import display
# used to display dataframe information without truncating
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
import pyarrow as pa

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

# Login through local Wallaroo instance

wl = wallaroo.Client()

To test this model, we will perform the following:

  • Create a workspace for our models.
  • Upload two models:
    • embedder: Takes pre-tokenized text documents (model input: 100 integers/datum; output 800 numbers/datum) and creates an embedding from them.
    • sentiment: The second model classifies the resulting embeddings from 0 to 1, which 0 being an unfavorable review, 1 being a favorable review.
  • Create a pipeline that will take incoming data and pass it to the embedder, which will pass the output to the sentiment model, and then export the final result.
  • To test it, we will use information that has already been tokenized and submit it to our pipeline and gauge the results.

Just for the sake of this tutorial, we’ll use the SDK below to create our workspace , assign as our current workspace, then display all of the workspaces we have at the moment. We’ll also set up for our models and pipelines down the road, so we have one spot to change names to whatever fits your organization’s standards best.

When we create our new workspace, we’ll save it in the Python variable workspace so we can refer to it as needed.

First we’ll create a workspace for our environment, and call it imdbworkspace. We’ll also set up our pipeline so it’s ready for our models.

workspace_name = f'imdbworkspace'
pipeline_name = f'imdbpipeline'
workspace = wl.get_workspace(name=workspace_name, create_if_not_exist=True)

wl.set_current_workspace(workspace)

imdb_pipeline = wl.build_pipeline(pipeline_name)
imdb_pipeline
nameimdbpipeline
created2024-07-22 21:20:35.217511+00:00
last_updated2024-07-22 21:20:35.217511+00:00
deployed(none)
workspace_id158
workspace_nameimdbworkspace
archNone
accelNone
tags
versions8c681982-d813-4a73-9b7a-45e679a2c08e
steps
publishedFalse

Just to make sure, let’s list our current workspace. If everything is going right, it will show us we’re in the imdb-workspace.

wl.get_current_workspace()
{'name': 'imdbworkspace', 'id': 158, 'archived': False, 'created_by': 'ff775520-72b5-4f8f-a755-f3cd28b8462f', 'created_at': '2024-07-22T21:20:35.193044+00:00', 'models': [], 'pipelines': [{'name': 'imdbpipeline', 'create_time': datetime.datetime(2024, 7, 22, 21, 20, 35, 217511, tzinfo=tzutc()), 'definition': '[]'}]}

Now we’ll upload our two models:

  • embedder.onnx: This will be used to embed the tokenized documents for evaluation.
  • sentiment_model.onnx: This will be used to analyze the review and determine if it is a positive or negative review. The closer to 0, the more likely it is a negative review, while the closer to 1 the more likely it is to be a positive review.
embedder = (wl.upload_model(f'embedder-o', 
                            './embedder.onnx', 
                            framework=wallaroo.framework.Framework.ONNX)
                            .configure(tensor_fields=["tensor"])
            )
smodel = (wl.upload_model(f'smodel-o', 
                          './sentiment_model.onnx', 
                          framework=wallaroo.framework.Framework.ONNX)
                          .configure(runtime="onnx", tensor_fields=["flatten_1"])
        )

With our models uploaded, now we’ll create our pipeline that will contain two steps:

  • First, it runs the data through the embedder.
  • Second, it applies it to our sentiment model.
# now make a pipeline
imdb_pipeline.add_model_step(embedder)
imdb_pipeline.add_model_step(smodel)
nameimdbpipeline
created2024-07-22 21:20:35.217511+00:00
last_updated2024-07-22 21:20:35.217511+00:00
deployed(none)
workspace_id158
workspace_nameimdbworkspace
archNone
accelNone
tags
versions8c681982-d813-4a73-9b7a-45e679a2c08e
steps
publishedFalse

Now that we have our pipeline set up with the steps, we can deploy the pipeline.

imdb_pipeline.deploy()

We’ll check the pipeline status to verify it’s deployed and the models are ready.

imdb_pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.3.4',
   'name': 'engine-57f676cbc8-f6bcf',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'imdbpipeline',
      'status': 'Running',
      'version': '9091559c-453b-44cc-a520-d8b96c1d8249'}]},
   'model_statuses': {'models': [{'name': 'smodel-o',
      'sha': '3473ea8700fbf1a1a8bfb112554a0dde8aab36758030dcde94a9357a83fd5650',
      'status': 'Running',
      'version': '35c3682a-5acf-49e2-86be-f50ec58baba9'},
     {'name': 'embedder-o',
      'sha': 'd083fd87fa84451904f71ab8b9adfa88580beb92ca77c046800f79780a20b7e4',
      'status': 'Running',
      'version': 'abe077cb-637b-48bc-8c06-855d2a24e4db'}]}}],
 'engine_lbs': [{'ip': '10.4.2.22',
   'name': 'engine-lb-75cf576f7f-nzcx4',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': []}

To test this out, we’ll start with a single piece of information from our data directory.

singleton = pd.DataFrame.from_records(
    [
    {
        "tensor":[
            1607.0,
            2635.0,
            5749.0,
            199.0,
            49.0,
            351.0,
            16.0,
            2919.0,
            159.0,
            5092.0,
            2457.0,
            8.0,
            11.0,
            1252.0,
            507.0,
            42.0,
            287.0,
            316.0,
            15.0,
            65.0,
            136.0,
            2.0,
            133.0,
            16.0,
            4311.0,
            131.0,
            286.0,
            153.0,
            5.0,
            2826.0,
            175.0,
            54.0,
            548.0,
            48.0,
            1.0,
            17.0,
            9.0,
            183.0,
            1.0,
            111.0,
            15.0,
            1.0,
            17.0,
            284.0,
            982.0,
            18.0,
            28.0,
            211.0,
            1.0,
            1382.0,
            8.0,
            146.0,
            1.0,
            19.0,
            12.0,
            9.0,
            13.0,
            21.0,
            1898.0,
            122.0,
            14.0,
            70.0,
            14.0,
            9.0,
            97.0,
            25.0,
            74.0,
            1.0,
            189.0,
            12.0,
            9.0,
            6.0,
            31.0,
            3.0,
            244.0,
            2497.0,
            3659.0,
            2.0,
            665.0,
            2497.0,
            63.0,
            180.0,
            1.0,
            17.0,
            6.0,
            287.0,
            3.0,
            646.0,
            44.0,
            15.0,
            161.0,
            50.0,
            71.0,
            438.0,
            351.0,
            31.0,
            5749.0,
            2.0,
            0.0,
            0.0
        ]
    }
]
)
results = imdb_pipeline.infer(singleton)
display(results)
timein.tensorout.dense_1anomaly.count
02024-07-22 21:22:07.225[1607.0, 2635.0, 5749.0, 199.0, 49.0, 351.0, 16.0, 2919.0, 159.0, 5092.0, 2457.0, 8.0, 11.0, 1252.0, 507.0, 42.0, 287.0, 316.0, 15.0, 65.0, 136.0, 2.0, 133.0, 16.0, 4311.0, 131.0, 286.0, 153.0, 5.0, 2826.0, 175.0, 54.0, 548.0, 48.0, 1.0, 17.0, 9.0, 183.0, 1.0, 111.0, 15.0, 1.0, 17.0, 284.0, 982.0, 18.0, 28.0, 211.0, 1.0, 1382.0, 8.0, 146.0, 1.0, 19.0, 12.0, 9.0, 13.0, 21.0, 1898.0, 122.0, 14.0, 70.0, 14.0, 9.0, 97.0, 25.0, 74.0, 1.0, 189.0, 12.0, 9.0, 6.0, 31.0, 3.0, 244.0, 2497.0, 3659.0, 2.0, 665.0, 2497.0, 63.0, 180.0, 1.0, 17.0, 6.0, 287.0, 3.0, 646.0, 44.0, 15.0, 161.0, 50.0, 71.0, 438.0, 351.0, 31.0, 5749.0, 2.0, 0.0, 0.0][0.37142318]0

Since that works, let’s load up all 50,000 rows and do a full inference on each of them via an Apache Arrow file. Wallaroo pipeline inferences use Apache Arrow as their core data type, making this inference fast.

We’ll do a demonstration with a pandas DataFrame and display the first 5 results.

results = imdb_pipeline.infer_from_file('./data/test_data_50K.arrow')
# using pandas DataFrame

outputs = results.to_pandas()
display(outputs.loc[:5, ["time","out.dense_1"]])
timeout.dense_1
02024-07-22 21:22:08.940[0.8980188]
12024-07-22 21:22:08.940[0.056596935]
22024-07-22 21:22:08.940[0.9260802]
32024-07-22 21:22:08.940[0.926919]
42024-07-22 21:22:08.940[0.6618577]
52024-07-22 21:22:08.940[0.48736304]

Undeploy

With our pipeline’s work done, we’ll undeploy it and give our Kubernetes environment back its resources.

imdb_pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ..................................... ok
nameimdbpipeline
created2024-07-22 21:20:35.217511+00:00
last_updated2024-07-22 21:20:36.657422+00:00
deployedFalse
workspace_id158
workspace_nameimdbworkspace
archx86
accelnone
tags
versions9091559c-453b-44cc-a520-d8b96c1d8249, 8c681982-d813-4a73-9b7a-45e679a2c08e
stepsembedder-o
publishedFalse

And there is our example. Please feel free to contact us at Wallaroo for if you have any questions.