LLM Monitoring Listener Example
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
LLM Listener Monitoring with Llama V3 Instruct
The following example demonstrates using LLM Monitoring Listeners to monitor a deployed Llama V3 Instruct LLM and score it based on a set of criteria.
This example uses the Llama V3 Instruct LLM. For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
LLM Monitoring Listeners leverage Wallaroo Inference Automation. LLM Monitoring Listeners are offline processes that score the LLM’s inference outputs against standard metrics including:
- Toxicity
- Sentiment
- Profanity
- Hate
- Etc
Users can also create custom LLM Monitoring Listeners to score the LLM against custom metrics. LLM Monitoring Listeners are composed of models trained to evaluate LLM outputs, so can be updated or refined according to the organization’s needs.
Tutorial Overview
This tutorial demonstrates the following:
- Upload a LLM Monitoring Listener developed to score LLMs off a set of standard criteria.
- Using Wallaroo Inference Automation, orchestrate the LLM Monitoring Listener to evaluate the LLama V3 Instruct LLM and display the scores.
Tutorial Steps
Import libraries
The first step is to import the libraries required.
import wallaroo
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import Framework
from wallaroo.deployment_config import DeploymentConfigBuilder
from IPython.display import display
# used to display DataFrame information without truncating
from IPython.display import display
import pandas as pd
pd.set_option('display.max_colwidth', None)
import pyarrow as pa
import json
import datetime
import time
import zipfile
Connect to the Wallaroo Instance
The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client(request_timeout=120)
Set Workspace and Variables
The following creates or connects to an existing workspace, and sets it as the current workspace. For more details on Wallaroo workspaces, see Wallaroo Workspace Management Guide.
We will set the variables used for our deployed LLM model, and the models used for our LLM Listener.
workspace_name = "llm-models"
model_name = "toxic-bert"
post_model_name = "postprocess"
wl.set_current_workspace(wl.get_workspace(workspace_name))
{'name': 'llm-models', 'id': 322863, 'archived': False, 'created_by': 'adf08921-fc3a-4018-b55f-775cd0796538', 'created_at': '2024-03-25T20:33:10.564383+00:00', 'models': [{'name': 'llama-instruct', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 3, 25, 20, 53, 31, 707885, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 3, 25, 20, 53, 31, 707885, tzinfo=tzutc())}, {'name': 'llama-v2', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 4, 18, 20, 58, 52, 684374, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 4, 18, 20, 58, 52, 684374, tzinfo=tzutc())}, {'name': 'llama3-instruct', 'versions': 2, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 5, 1, 19, 19, 18, 437490, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 5, 1, 18, 13, 47, 784249, tzinfo=tzutc())}, {'name': 'toxic-bert', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 5, 2, 23, 22, 2, 675607, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 5, 2, 23, 22, 2, 675607, tzinfo=tzutc())}], 'pipelines': [{'name': 'llama-instruct-pipeline', 'create_time': datetime.datetime(2024, 4, 11, 17, 5, 46, 75486, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama2-pipeline', 'create_time': datetime.datetime(2024, 4, 18, 21, 17, 44, 893427, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llamav3-instruct', 'create_time': datetime.datetime(2024, 5, 1, 19, 51, 8, 240637, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama-shadow', 'create_time': datetime.datetime(2024, 5, 2, 21, 50, 54, 293036, tzinfo=tzutc()), 'definition': '[]'}]}
Upload LLM Listener Models and Create a Monitoring Pipeline
This monitoring pipeline consists of a Hugging Face sentiment analyzer and a BYOP post-processing step.
The following models are used:
toxic_bert
: A Hugging Face Text Classification model that evaluates LLM outputs and outputs an array of scores.postprocess
: A Python model that takes thetoxic_bert
outputs and converts them into the following field outputs, scored from 0 to 1, with 1 being the worst:identity_hate
insult
obscene
severe_toxic
threat
toxic
# upload the sentiment analyzer
input_schema = pa.schema([
pa.field('inputs', pa.string()), # required
pa.field('top_k', pa.int64()),
])
output_schema = pa.schema([
pa.field('label', pa.list_(pa.string(), list_size=6)), # list with a number of items same as top_k, list_size can be skipped but may lead in worse performance
pa.field('score', pa.list_(pa.float64(), list_size=6)), # list with a number of items same as top_k, list_size can be skipped but may lead in worse performance
])
framework=Framework.HUGGING_FACE_TEXT_CLASSIFICATION
model_file_name = './models/unitary-toxic-bert.zip'
bert_model = wl.upload_model(model_name,
model_file_name,
framework=framework,
input_schema=input_schema,
output_schema=output_schema,
convert_wait=True)
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime....................................................successful
Ready
# upload the postprocessor
input_schema = pa.schema([
pa.field('label', pa.list_(pa.string(), list_size=6)), # list with a number of items same as top_k, list_size can be skipped but may lead in worse performance
pa.field('score', pa.list_(pa.float64(), list_size=6)), # list with a number of items same as top_k, list_size can be skipped but may lead in worse performance
])
# Define the schema for the 'output' DataFrame
output_schema = pa.schema([
pa.field('identity_hate', pa.float64()),
pa.field('insult', pa.float64()),
pa.field('obscene', pa.float64()),
pa.field('severe_toxic', pa.float64()),
pa.field('threat', pa.float64()),
pa.field('toxic', pa.float64())
])
# upload the post process model
post_model = wl.upload_model("postprocess",
"./models/postprocess.zip",
framework=wallaroo.framework.Framework.PYTHON,
input_schema=input_schema,
output_schema=output_schema
)
display(bert_model)
Name | toxic-bert |
Version | e511643c-30a4-48b9-a45e-f458d991a916 |
File Name | unitary-toxic-bert.zip |
SHA | 30b5c2d0c1a2102ad63ef7d84e953b018b45a0c021ea14916708ea1c8142ff38 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mlflow-deploy:v2023.4.2-4668 |
Architecture | None |
Updated At | 2024-02-May 23:26:34 |
display(post_model)
Name | postprocess |
Version | 3a02feda-7336-4ef4-820a-f13265e3f251 |
File Name | postprocess.py |
SHA | 0d230ee260e4a86b2cc62c66445c7173e23a6f1bf696d239b45b4f0e2086ca85 |
Status | ready |
Image Path | None |
Architecture | None |
Updated At | 2024-02-May 23:30:10 |
Deploy the Listener Models
We deploy the listener models. We create a deployment configuration and set the Hugging Face sentiment analyzer to 4 cpus and 8 Gi RAM.
We create the pipeline with the build_pipeline
method, and add the models as the pipeline steps.
# this is the summarizer config
deployment_config = wallaroo.DeploymentConfigBuilder() \
.cpus(0.25).memory('1Gi') \
.sidekick_cpus(bert_model, 4) \
.sidekick_memory(bert_model, "8Gi") \
.build()
pipeline_name = 'full-toxmonitor-pipeline'
pipeline=wl.build_pipeline(pipeline_name)
pipeline.add_model_step(bert_model)
pipeline.add_model_step(post_model)
name | full-toxmonitor-pipeline |
---|---|
created | 2024-05-02 23:30:28.397835+00:00 |
last_updated | 2024-05-02 23:30:28.397835+00:00 |
deployed | (none) |
arch | None |
tags | |
versions | d75bc057-6392-42a9-8bdb-5d3661b731c4 |
steps | |
published | False |
With the pipeline set, we deploy the pipeline with the deployment configuration. This allocates the resources from the cluster for the LLM Listener models use.
Once the models are deployed, we check the status and verify it’s running.
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.60.4.215',
'name': 'engine-86569ff7c-6rcdp',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'full-toxmonitor-pipeline',
'status': 'Running'}]},
'model_statuses': {'models': [{'name': 'toxic-bert',
'version': 'e511643c-30a4-48b9-a45e-f458d991a916',
'sha': '30b5c2d0c1a2102ad63ef7d84e953b018b45a0c021ea14916708ea1c8142ff38',
'status': 'Running'},
{'name': 'postprocess',
'version': '3a02feda-7336-4ef4-820a-f13265e3f251',
'sha': '0d230ee260e4a86b2cc62c66445c7173e23a6f1bf696d239b45b4f0e2086ca85',
'status': 'Running'}]}}],
'engine_lbs': [{'ip': '10.60.2.44',
'name': 'engine-lb-5df9b487cf-mjmfl',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.60.2.45',
'name': 'engine-sidekick-toxic-bert-9-7f69f7f58f-98z55',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Orchestrate LLM Listener
The LLM Listener leverages the Wallaroo Inference Automation for its execution. This is uploaded from the file llm-monitor.zip
as a Workload Orchestration; this includes a Python script detailing how to deploy the LLM Listener models and evaluate the outputs of the LLM models.
The orchestration performs the following when executed:
- Accept the following arguments to determine which LLM to evaluate:
llm_workspace
: The name of the workspace the LLM is deployed from.llm_pipeline
: The pipeline the LLM is deployed from.llm_output_field
: The LLM’s text output field.monitor_workspace
: The workspace the LLM Listener models are deployed from.monitor_pipeline
: The pipeline the LLM listener models are deployed from.window_length
: The amount of time to evaluate from when the task is executed in hours. For example,1
would evaluate the past hour. Use-1
for no limits. This will gather the standard inference results window.n_toxlabels
: The number of toxic labels. For ourtoxic_bert
LLM Listener, the number of fields is 6.
- Deploy LLM Listener models.
- Gathers the
llama3-instruct
’s Inference Results, and processes theout.generated_text
field through the LLM Listener models.- These either either the default inference result outputs, or specified by a date range of inference results.
- The LLM listener then scores the LLM’s outputs and provides the scores listed above. These are extracted at any time as its own Inference Results.
As a Workload Orchestration, the LLM Listener is executed either as a Run Once - which executes once, reports its results, then stops, or Run Scheduled, which is executed on set schedule (every 5 minutes, every hour, etc).
The following shows running the LLM Listener as a Run Once task, that evaluates the llama3-instruct
LLM. The LLM Listener arguments can be modified to evaluate any other deployed LLMs with their own text output fields.
This assumes that the LLM Listener was already uploaded and is ready to accept new tasks, and we have saved it to the variable llm_listener
.
See Inference Automation for more details.
The LLM Monitoring Listener contains a Python script that deploy the LLM Monitoring listener, then performs an inference against the LLMs output to determine the scores. These scores are saved to as standard Wallaroo Inference Results.
Once complete it undeployed the LLM Monitoring Listener to save resources. The following shows using the LLM’s logs and isolating the specific output field, then performing the inference from the LLM’s inference data.
# create the input for the toxicity model
input_data = {
"inputs": llm_logs[llm_output_field],
}
dataframe = pd.DataFrame(input_data)
dataframe['top_k'] = n_toxlabels
toxresults = toxmonitor.infer(dataframe)
print(toxresults)
# this is mostly for demo purposes
print("Avg Batch Toxicity:", np.mean(toxresults['out.toxic'].apply(lambda x:x[0])))
print("Over Threshold:", sum(toxresults['out.toxic'].apply(lambda x:x[0]) > 0.001))
The LLM Monitoring Listener is orchestrated from the file llm-monitor.zip
.
llm_listener = wl.upload_orchestration(name="llm-toxicity-listener", path='./llm-monitor.zip')
Execute LLM Listener
As a Workload Orchestration, the LLM Listener is executed either as a Run Once - which executes once, reports its results, then stops, or Run Scheduled, which is executed on set schedule (every 5 minutes, every hour, etc).
The following shows running the LLM Listener as a Run Once task, that evaluates the llama3-instruct
LLM. The LLM Listener arguments can be modified to evaluate any other deployed LLMs with their own text output fields.
This assumes that the LLM Listener was already uploaded and is ready to accept new tasks, and we have saved it to the variable llm_listener
.
Here we create the Run Schedule task to execute every hour provide it the deployed LLM’s workspace and pipeline, and the LLM Listener’s models workspace and name. We give the task the name sample monitor
.
# these are the default args
args = {
'llm_workspace' : 'llm-models' ,
'llm_pipeline': 'llamav3-instruct',
'llm_output_field': 'out.generated_text',
'monitor_workspace': 'llm-models',
'monitor_pipeline' : 'full-toxmonitor-pipeline',
'window_length': -1, # in hours. If -1, no limit (for testing)
'n_toxlabels': 6,
}
schedule={'00 * * * *'}
task = llm_listener.run_scheduled(name="monitor-initial-test",
schedule=schedule,
json_args=args,
timeout=1000)
The LLM Listener models results are stored in in the Inference Results logs. Each task run generates a new entry.
From these results we can monitor the performance of the LLM results and check for toxicity or other issues. These are used with the Wallaroo assays to track against an established baseline.
llm_listener_results = pipeline.logs()
display(llm_listener_results)
time | in.inputs | in.top_k | out.identity_hate | out.insult | out.obscene | out.severe_toxic | out.threat | out.toxic | anomaly.count | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2024-05-02 23:38:11.716 | Wallaroo.AI is an AI platform that enables developers to build, deploy, and manage AI and machine learning models at scale. It provides a cloud-based infrastructure for building, training, and deploying AI models, as well as a set of tools and APIs for integrating AI into various applications.\n\nWallaroo.AI is designed to make it easy for developers to build and deploy AI models, regardless of their level of expertise in machine learning. It provides a range of features, including support for popular machine learning frameworks such as TensorFlow and PyTorch, as well as a set of pre-built AI models and APIs for common use cases such as image and speech recognition, natural language processing, and predictive analytics.\n\nWallaroo.AI is particularly well-suited for developers who are looking to build AI-powered applications, but may not have extensive expertise in machine learning or AI development. It provides a range of tools and resources to help developers get started with building AI-powered applications, including a cloud-based development environment, a set of pre-built AI models and APIs, and a range of tutorials and documentation. | 6 | [0.00014974642544984818] | [0.00017831822333391756] | [0.00018145183275919408] | [0.00012232053268235177] | [0.00013229982869233936] | [0.0006922021857462823] | 0 |
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today