LLM Monitoring Listener Example

Table of Contents

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

LLM Listener Monitoring with Llama V3 Instruct

The following example demonstrates using LLM Monitoring Listeners to monitor a deployed Llama V3 Instruct LLM and score it based on a set of criteria.

This example uses the Llama V3 Instruct LLM. For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:

LLM Monitoring Listeners leverage Wallaroo Inference Automation. LLM Monitoring Listeners are offline processes that score the LLM’s inference outputs against standard metrics including:

  • Toxicity
  • Sentiment
  • Profanity
  • Hate
  • Etc

Users can also create custom LLM Monitoring Listeners to score the LLM against custom metrics. LLM Monitoring Listeners are composed of models trained to evaluate LLM outputs, so can be updated or refined according to the organization’s needs.

Tutorial Overview

This tutorial demonstrates the following:

  • Upload a LLM Monitoring Listener developed to score LLMs off a set of standard criteria.
  • Using Wallaroo Inference Automation, orchestrate the LLM Monitoring Listener to evaluate the LLama V3 Instruct LLM and display the scores.

Tutorial Steps

Import libraries

The first step is to import the libraries required.

import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework
from wallaroo.deployment_config import DeploymentConfigBuilder

from IPython.display import display

# used to display DataFrame information without truncating
from IPython.display import display
import pandas as pd
pd.set_option('display.max_colwidth', None)

import pyarrow as pa
import json
import datetime
import time
import zipfile

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client(request_timeout=120)

Set Workspace and Variables

The following creates or connects to an existing workspace, and sets it as the current workspace. For more details on Wallaroo workspaces, see Wallaroo Workspace Management Guide.

We will set the variables used for our deployed LLM model, and the models used for our LLM Listener.

workspace_name = "llm-models"  
model_name = "toxic-bert"
post_model_name = "postprocess"

wl.set_current_workspace(wl.get_workspace(workspace_name))
{'name': 'llm-models', 'id': 322863, 'archived': False, 'created_by': 'adf08921-fc3a-4018-b55f-775cd0796538', 'created_at': '2024-03-25T20:33:10.564383+00:00', 'models': [{'name': 'llama-instruct', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 3, 25, 20, 53, 31, 707885, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 3, 25, 20, 53, 31, 707885, tzinfo=tzutc())}, {'name': 'llama-v2', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 4, 18, 20, 58, 52, 684374, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 4, 18, 20, 58, 52, 684374, tzinfo=tzutc())}, {'name': 'llama3-instruct', 'versions': 2, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 5, 1, 19, 19, 18, 437490, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 5, 1, 18, 13, 47, 784249, tzinfo=tzutc())}, {'name': 'toxic-bert', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 5, 2, 23, 22, 2, 675607, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 5, 2, 23, 22, 2, 675607, tzinfo=tzutc())}], 'pipelines': [{'name': 'llama-instruct-pipeline', 'create_time': datetime.datetime(2024, 4, 11, 17, 5, 46, 75486, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama2-pipeline', 'create_time': datetime.datetime(2024, 4, 18, 21, 17, 44, 893427, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llamav3-instruct', 'create_time': datetime.datetime(2024, 5, 1, 19, 51, 8, 240637, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama-shadow', 'create_time': datetime.datetime(2024, 5, 2, 21, 50, 54, 293036, tzinfo=tzutc()), 'definition': '[]'}]}

Upload LLM Listener Models and Create a Monitoring Pipeline

This monitoring pipeline consists of a Hugging Face sentiment analyzer and a BYOP post-processing step.

The following models are used:

  • toxic_bert: A Hugging Face Text Classification model that evaluates LLM outputs and outputs an array of scores.
  • postprocess: A Python model that takes the toxic_bert outputs and converts them into the following field outputs, scored from 0 to 1, with 1 being the worst:
    • identity_hate
    • insult
    • obscene
    • severe_toxic
    • threat
    • toxic
# upload the sentiment analyzer

input_schema = pa.schema([
     pa.field('inputs', pa.string()), # required
     pa.field('top_k', pa.int64()),  
])

output_schema = pa.schema([
     pa.field('label', pa.list_(pa.string(), list_size=6)), # list with a number of items same as top_k, list_size can be skipped but may lead in worse performance
     pa.field('score', pa.list_(pa.float64(), list_size=6)), # list with a number of items same as top_k, list_size can be skipped but may lead in worse performance
])

framework=Framework.HUGGING_FACE_TEXT_CLASSIFICATION
model_file_name = './models/unitary-toxic-bert.zip'

bert_model = wl.upload_model(model_name,
                         model_file_name,
                         framework=framework,
                         input_schema=input_schema,
                         output_schema=output_schema,
                         convert_wait=True)
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime....................................................successful

Ready
# upload the postprocessor 

input_schema = pa.schema([
        pa.field('label', pa.list_(pa.string(), list_size=6)), # list with a number of items same as top_k, list_size can be skipped but may lead in worse performance
        pa.field('score', pa.list_(pa.float64(), list_size=6)), # list with a number of items same as top_k, list_size can be skipped but may lead in worse performance
    ])

# Define the schema for the 'output' DataFrame
output_schema = pa.schema([
    pa.field('identity_hate', pa.float64()),
    pa.field('insult', pa.float64()),        
    pa.field('obscene', pa.float64()),      
    pa.field('severe_toxic', pa.float64()),  
    pa.field('threat', pa.float64()),        
    pa.field('toxic', pa.float64())           
])

# upload the post process model
post_model = wl.upload_model("postprocess", 
                             "./models/postprocess.zip", 
                             framework=wallaroo.framework.Framework.PYTHON,
                             input_schema=input_schema, 
                             output_schema=output_schema 
                            )
display(bert_model)
Nametoxic-bert
Versione511643c-30a4-48b9-a45e-f458d991a916
File Nameunitary-toxic-bert.zip
SHA30b5c2d0c1a2102ad63ef7d84e953b018b45a0c021ea14916708ea1c8142ff38
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mlflow-deploy:v2023.4.2-4668
ArchitectureNone
Updated At2024-02-May 23:26:34
display(post_model)
Namepostprocess
Version3a02feda-7336-4ef4-820a-f13265e3f251
File Namepostprocess.py
SHA0d230ee260e4a86b2cc62c66445c7173e23a6f1bf696d239b45b4f0e2086ca85
Statusready
Image PathNone
ArchitectureNone
Updated At2024-02-May 23:30:10

Deploy the Listener Models

We deploy the listener models. We create a deployment configuration and set the Hugging Face sentiment analyzer to 4 cpus and 8 Gi RAM.

We create the pipeline with the build_pipeline method, and add the models as the pipeline steps.

# this is the summarizer config 
deployment_config = wallaroo.DeploymentConfigBuilder() \
    .cpus(0.25).memory('1Gi') \
    .sidekick_cpus(bert_model, 4) \
    .sidekick_memory(bert_model, "8Gi") \
    .build()

pipeline_name = 'full-toxmonitor-pipeline'
pipeline=wl.build_pipeline(pipeline_name)
pipeline.add_model_step(bert_model)
pipeline.add_model_step(post_model)
namefull-toxmonitor-pipeline
created2024-05-02 23:30:28.397835+00:00
last_updated2024-05-02 23:30:28.397835+00:00
deployed(none)
archNone
tags
versionsd75bc057-6392-42a9-8bdb-5d3661b731c4
steps
publishedFalse

With the pipeline set, we deploy the pipeline with the deployment configuration. This allocates the resources from the cluster for the LLM Listener models use.

Once the models are deployed, we check the status and verify it’s running.

pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.60.4.215',
   'name': 'engine-86569ff7c-6rcdp',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'full-toxmonitor-pipeline',
      'status': 'Running'}]},
   'model_statuses': {'models': [{'name': 'toxic-bert',
      'version': 'e511643c-30a4-48b9-a45e-f458d991a916',
      'sha': '30b5c2d0c1a2102ad63ef7d84e953b018b45a0c021ea14916708ea1c8142ff38',
      'status': 'Running'},
     {'name': 'postprocess',
      'version': '3a02feda-7336-4ef4-820a-f13265e3f251',
      'sha': '0d230ee260e4a86b2cc62c66445c7173e23a6f1bf696d239b45b4f0e2086ca85',
      'status': 'Running'}]}}],
 'engine_lbs': [{'ip': '10.60.2.44',
   'name': 'engine-lb-5df9b487cf-mjmfl',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.60.2.45',
   'name': 'engine-sidekick-toxic-bert-9-7f69f7f58f-98z55',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Orchestrate LLM Listener

The LLM Listener leverages the Wallaroo Inference Automation for its execution. This is uploaded from the file llm-monitor.zip as a Workload Orchestration; this includes a Python script detailing how to deploy the LLM Listener models and evaluate the outputs of the LLM models.

The orchestration performs the following when executed:

  • Accept the following arguments to determine which LLM to evaluate:
    • llm_workspace: The name of the workspace the LLM is deployed from.
    • llm_pipeline: The pipeline the LLM is deployed from.
    • llm_output_field: The LLM’s text output field.
    • monitor_workspace: The workspace the LLM Listener models are deployed from.
    • monitor_pipeline: The pipeline the LLM listener models are deployed from.
    • window_length: The amount of time to evaluate from when the task is executed in hours. For example, 1 would evaluate the past hour. Use -1 for no limits. This will gather the standard inference results window.
    • n_toxlabels: The number of toxic labels. For our toxic_bert LLM Listener, the number of fields is 6.
  • Deploy LLM Listener models.
  • Gathers the llama3-instruct’s Inference Results, and processes the out.generated_text field through the LLM Listener models.
    • These either either the default inference result outputs, or specified by a date range of inference results.
  • The LLM listener then scores the LLM’s outputs and provides the scores listed above. These are extracted at any time as its own Inference Results.

As a Workload Orchestration, the LLM Listener is executed either as a Run Once - which executes once, reports its results, then stops, or Run Scheduled, which is executed on set schedule (every 5 minutes, every hour, etc).

The following shows running the LLM Listener as a Run Once task, that evaluates the llama3-instruct LLM. The LLM Listener arguments can be modified to evaluate any other deployed LLMs with their own text output fields.

This assumes that the LLM Listener was already uploaded and is ready to accept new tasks, and we have saved it to the variable llm_listener.

See Inference Automation for more details.

The LLM Monitoring Listener contains a Python script that deploy the LLM Monitoring listener, then performs an inference against the LLMs output to determine the scores. These scores are saved to as standard Wallaroo Inference Results.

Once complete it undeployed the LLM Monitoring Listener to save resources. The following shows using the LLM’s logs and isolating the specific output field, then performing the inference from the LLM’s inference data.

# create the input for the toxicity model
input_data = {
        "inputs": llm_logs[llm_output_field], 
}
dataframe = pd.DataFrame(input_data)
dataframe['top_k'] = n_toxlabels                    

toxresults = toxmonitor.infer(dataframe)
print(toxresults)
# this is mostly for demo purposes
print("Avg Batch Toxicity:", np.mean(toxresults['out.toxic'].apply(lambda x:x[0])))
print("Over Threshold:", sum(toxresults['out.toxic'].apply(lambda x:x[0]) > 0.001))

The LLM Monitoring Listener is orchestrated from the file llm-monitor.zip.

llm_listener = wl.upload_orchestration(name="llm-toxicity-listener", path='./llm-monitor.zip')

Execute LLM Listener

As a Workload Orchestration, the LLM Listener is executed either as a Run Once - which executes once, reports its results, then stops, or Run Scheduled, which is executed on set schedule (every 5 minutes, every hour, etc).

The following shows running the LLM Listener as a Run Once task, that evaluates the llama3-instruct LLM. The LLM Listener arguments can be modified to evaluate any other deployed LLMs with their own text output fields.

This assumes that the LLM Listener was already uploaded and is ready to accept new tasks, and we have saved it to the variable llm_listener.

Here we create the Run Schedule task to execute every hour provide it the deployed LLM’s workspace and pipeline, and the LLM Listener’s models workspace and name. We give the task the name sample monitor.

# these are the default args
args = {
    'llm_workspace' : 'llm-models' ,
    'llm_pipeline': 'llamav3-instruct',
    'llm_output_field': 'out.generated_text',
    'monitor_workspace': 'llm-models',
    'monitor_pipeline' : 'full-toxmonitor-pipeline',
    'window_length': -1,  # in hours. If -1, no limit (for testing)
    'n_toxlabels': 6,
}

schedule={'00 * * * *'}

task = llm_listener.run_scheduled(name="monitor-initial-test", 
                                  schedule=schedule, 
                                  json_args=args, 
                                  timeout=1000)

The LLM Listener models results are stored in in the Inference Results logs. Each task run generates a new entry.

From these results we can monitor the performance of the LLM results and check for toxicity or other issues. These are used with the Wallaroo assays to track against an established baseline.

llm_listener_results = pipeline.logs()
display(llm_listener_results)
timein.inputsin.top_kout.identity_hateout.insultout.obsceneout.severe_toxicout.threatout.toxicanomaly.count
02024-05-02 23:38:11.716Wallaroo.AI is an AI platform that enables developers to build, deploy, and manage AI and machine learning models at scale. It provides a cloud-based infrastructure for building, training, and deploying AI models, as well as a set of tools and APIs for integrating AI into various applications.\n\nWallaroo.AI is designed to make it easy for developers to build and deploy AI models, regardless of their level of expertise in machine learning. It provides a range of features, including support for popular machine learning frameworks such as TensorFlow and PyTorch, as well as a set of pre-built AI models and APIs for common use cases such as image and speech recognition, natural language processing, and predictive analytics.\n\nWallaroo.AI is particularly well-suited for developers who are looking to build AI-powered applications, but may not have extensive expertise in machine learning or AI development. It provides a range of tools and resources to help developers get started with building AI-powered applications, including a cloud-based development environment, a set of pre-built AI models and APIs, and a range of tutorials and documentation.6[0.00014974642544984818][0.00017831822333391756][0.00018145183275919408][0.00012232053268235177][0.00013229982869233936][0.0006922021857462823]0

For access to these sample models and for a demonstration: