IMDB Tutorial

The IMDB Tutorial demonstrates how to use Wallaroo to determine if reviews are positive or negative.

The following example demonstrates how to use Wallaroo with chained models. In this example, we will be using information from the IMDB (Internet Movie DataBase) with a sentiment model to detect whether a given review is positive or negative. Imagine using this to automatically scan Tweets regarding your product and finding either customers who need help or have nice things to say about your product.

The following example is based on the Large Movie Review Dataset, and sample data can be downloaded from the aclIMDB dataset.

Connect to Wallaroo

If you’ve installed Wallaroo into your Kubernetes cluster and started the Jupyter Hub service, the Wallaroo Python library is already available. We’ll create a connection to Wallaroo. Note that in this example, our authentication is auth_type="user_password". Modify this connection based on your Wallaroo configuration.

import wallaroo
wl = wallaroo.Client()

To test this model, we will perform the following:

  • Create a workspace for our models.
  • Upload two models:
    • embedder: Takes pre-tokenized text documents (model input: 100 integers/datum; output 800 numbers/datum) and creates an embedding from them.
    • sentiment: The second model classifies the resulting embeddings from 0 to 1, which 0 being an unfavorable review, 1 being a favorable review.
  • Create a pipeline that will take incoming data and pass it to the embedder, which will pass the output to the sentiment model, and then export the final result.
  • To test it, we will use information that has already been tokenized and submit it to our pipeline and gauge the results.

First we’ll create a workspace for our environment, and call it imdb-workspace:

new_workspace = wl.create_workspace("imdb-workspace")
_ = wl.set_current_workspace(new_workspace)

Just to make sure, let’s list our current workspace. If everything is going right, it will show us we’re in the imdb-workspace.

wl.get_current_workspace()
Result

{
'name': 'imdb-workspace',
'id': 4,
'archived': False,
'created_by': '45e6b641-fe57-4fb2-83d2-2c2bd201efe8',
'created_at': '2022-03-29T20: 23: 08.742676+00: 00',
'models': [],
'pipelines': []
}

Now we’ll upload our two models:

  • embedder.onnx: This will be used to embed the tokenized documents for evaluation.
  • sentiment_model.onnx: This will be used to analyze the review and determine if it is a positive or negative review. The closer to 0, the more likely it is a negative review, while the closer to 1 the more likely it is to be a positive review.
embedder = wl.upload_model('embedder-o', './embedder.onnx')
             .configure()
smodel = wl.upload_model('smodel-o', './sentiment_model.onnx')
           .configure(runtime="onnx", tensor_fields=["flatten_1"])

With our models uploaded, now we’ll create our pipeline that will contain two steps:

  • First, it runs the data through the embedder.
  • Second, it applies it to our sentiment model.
# now make a pipeline
imdb_pipeline = wl.build_pipeline("imdb-pipeline")
                  .add_model_step(embedder)
                  .add_model_step(smodel)

Now that we have our pipeline set up with the steps, we can deploy the pipeline.

imdb_pipeline.deploy()
Result

Waiting for deployment - this will take up to 45s ...... ok

{
    'name': 'imdb-pipeline',
    'create_time': datetime.datetime(2022,3,29,20,23,28,518946,tzinfo=tzutc()),
    'definition': 
        "[
            {
                'ModelInference': 
                    {
                        'models': 
                            [
                                {
                                    'name': 'embedder-o',
                                    'version': '23a33c3d-68e6-4bdb-a8bc-32ea846908ee',
                                    'sha': 'd083fd87fa84451904f71ab8b9adfa88580beb92ca77c046800f79780a20b7e4'
                                }
                            ]
                    }
            },
            {
                'ModelInference': 
                    {
                        'models': 
                        [
                            {
                                'name': 'smodel-o',
                                'version': '2c298aa9-be9d-482d-8188-e3564bdbab43',
                                'sha': '3473ea8700fbf1a1a8bfb112554a0dde8aab36758030dcde94a9357a83fd5650'
                            }
                        ]
                    }
            }
        ]"
}

We’ll check the pipeline status to verify it’s deployed and the models are ready.

imdb_pipeline.status()
Result

{'status': 'Running',
    'details': None,
    'engines': [{'ip': '10.12.1.35',
    'name': 'engine-7b95b5695d-qjjtl',
    'status': 'Running',
    'reason': None,
    'pipeline_statuses': {'pipelines': [{'id': 'imdb-pipeline',
        'status': 'Running'}]},
    'model_statuses': {'models': [{'name': 'embedder-o',
        'version': '23a33c3d-68e6-4bdb-a8bc-32ea846908ee',
        'sha': 'd083fd87fa84451904f71ab8b9adfa88580beb92ca77c046800f79780a20b7e4',
        'status': 'Running'},
        {'name': 'smodel-o',
        'version': '2c298aa9-be9d-482d-8188-e3564bdbab43',
        'sha': '3473ea8700fbf1a1a8bfb112554a0dde8aab36758030dcde94a9357a83fd5650',
        'status': 'Running'}]}}],
    'engine_lbs': [{'ip': '10.12.1.34',
    'name': 'engine-lb-85846c64f8-z6vq9',
    'status': 'Running',
    'reason': None}]}

To test this out, we’ll start with a single piece of information from our data directory.

results = imdb_pipeline.infer_from_file('data/singleton.json')

results[0].data()
Result

[array([[0.37142318]])]

Since that works, let’s load up all 50 rows and do a full inference on each of them. Note that Jupyter Hub has a size limitation, so for production systems the outputs should be piped out to a different output.

# for the victory lap, infer on all 50 rows
results = imdb_pipeline.infer_from_file('./data/test_data.json')
results[0].data()
Result

[array([[3.71423185e-01],
        [9.65576112e-01],
        [7.60161877e-02],
        [2.46452361e-01],
        [8.63283277e-02],
        [6.39613509e-01],
        [2.47336328e-02],
        [5.02990067e-01],
        [9.34223831e-01],
        [7.17751265e-01],
        [2.04768777e-03],
        [3.55861127e-01],
        [2.48722464e-01],
        [2.73299277e-01],
        [9.60162282e-03],
        [4.95020479e-01],
        [8.30442309e-02],
        [5.34835458e-02],
        [2.74230242e-02],
        [1.26478374e-02],
        [2.39091218e-02],
        [8.63728166e-01],
        [1.57089770e-01],
        [3.46490622e-01],
        [3.56459022e-01],
        [7.97988474e-02],
        [6.78595304e-02],
        [3.17764282e-03],
        [4.39540178e-01],
        [3.33117247e-02],
        [1.46508217e-04],
        [7.39861846e-01],
        [1.51472032e-01],
        [2.41219997e-04],
        [2.69098580e-02],
        [9.06612277e-01],
        [8.55922699e-04],
        [4.60651517e-03],
        [4.51257825e-02],
        [6.71328604e-02],
        [3.86106908e-01],
        [2.73625672e-01],
        [3.87400389e-01],
        [1.92073256e-01],
        [1.40319228e-01],
        [1.50666535e-02],
        [1.26731277e-01],
        [7.53879547e-03],
        [9.44640994e-01],
        [7.55301118e-03]])]

Undeploy

With our pipeline’s work done, we’ll undeploy it to free up computational resources in the environment.

imdb_pipeline.undeploy()
Result

{
    'name': 'imdb-pipeline',
	'create_time': datetime.datetime(2022,3,29,20,23,28,518946,tzinfo=tzutc()),
	'definition': 
        "[
            {
                'ModelInference': 
                {
                    'models': 
                    [
                        {
                            'name': 'embedder-o',
                            'version': '23a33c3d-68e6-4bdb-a8bc-32ea846908ee',
                            'sha': 'd083fd87fa84451904f71ab8b9adfa88580beb92ca77c046800f79780a20b7e4'
                        }
                    ]
                }
        },
        {
            'ModelInference': 
                {
                    'models': 
                        [
                            {
                                'name': 'smodel-o',
	                            'version': '2c298aa9-be9d-482d-8188-e3564bdbab43',
	                            'sha': '3473ea8700fbf1a1a8bfb112554a0dde8aab36758030dcde94a9357a83fd5650'
                            }
                        ]
                }
        }
        ]"
}

And there is our example. Please feel free to contact us at Wallaroo for if you have any questions.