LLM Validation Listener Example
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
LLM Validation Listener Example
The following example demonstrates using LLM Validation Listener to evaluate LLM performance at inference time.
LLM Validation Listener validates LLMs’ inferences during the inference process. These validations are implemented as an in-line step in the same Wallaroo pipeline with the LLM. These validations are customized for whatever monitoring the user request, such as summary quality, translation quality score, and other use cases.
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
LLM Validation Listeners follow this process:
- Each validation step is uploaded as in Wallaroo Custom Model aka Bring Your Own Predict (BYOP)) or Hugging Face) model into Wallaroo. These models monitor the outputs of the LLM and score them based on whatever criteria the data scientist developers.
- These model steps evaluate inference data directly from the LLM, creating additional fields based on the LLM’s inference output.
- For example, if the LLM outputs the field
text
, the validation model’s outputs would be the fieldssummary_quality
,translation_quality_score
, etc.
- For example, if the LLM outputs the field
- These steps are monitored with Wallaroo assays to analyze the scores each validation step produces and publish assay analyses based on established criteria.
Tutorial Overview
This tutorial demonstrates the following:
- Upload an LLM Validation Listener developed to evaluate the output of a Llama v3 Llamacpp LLM previously uploaded to Wallaroo.
- Add the LLM Validation Listener in the same pipeline as the Llama v3 Llamacpp LLM.
- Perform sample inference and show the how the LLM Validation Listener scores the LLM outputs.
Tutorial Steps
Import libraries
The first step is to import the libraries required.
import json
import os
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Set Workspace
The following creates or connects to an existing workspace based on the variable workspace_name
, and sets it as the current workspace. For more details on Wallaroo workspaces, see Wallaroo Workspace Management Guide.
Upload LLM Validation Listener Model
The LLM Validation Listener model is uploaded form the BYOP model summarisation_quality_final.zip
, which is a Quality Summarization model that evaluates the LLM’s generated_text
output and scores it. This has the following inputs and outputs:
- Inputs
text
: Stringgenerated_text
: String ; This is the output of the Llama V3 model.
- Outputs
generated_text
: String ; This is the samegenerated_text
from the Llama v3 model, passed through as an inference output.score
: Float64; The total score based on thegenerated_text
field.
Schema Definition
We set the model’s input and output schemas in Apache PyArrow Schema format.
input_schema = pa.schema([
pa.field('text', pa.string()),
pa.field('generated_text', pa.string())
])
output_schema = pa.schema([
pa.field('generated_text', pa.string()),
pa.field('score', pa.float64()),
])
Upload the Model
We now upload the model as the framework wallaroo.framework.Framework.CUSTOM
. For more details on uploading models, see Model Upload. We store the model version reference to the variable validation_model
.
validation_model = wl.upload_model('summquality',
'summarisation_quality_final.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema
)
display(validation_model)
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime...................................................................................................successful
Ready
Name | summquality |
Version | 14fca0ba-69d1-44b0-9fbb-ff39c07884b8 |
File Name | summarisation_quality_final.zip |
SHA | c221cf1cab35c089847138aeac5a2e179430fa45fbddd281bcb1614876541c81 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mlflow-deploy:v2023.4.1-4351 |
Architecture | None |
Updated At | 2024-23-May 18:27:38 |
Retrieve the LLM
If the LLM is already uploaded we retrieve it with the method wallaroo.client.Client.get_model
.
llama = wl.get_model('llamav3-llamacpp-passthrough-1')
display(llama)
Name | llamav3-llamacpp-passthrough-1 |
Version | 71993033-561b-455d-89ea-933f112eb523 |
File Name | byop_llamacpp_llama3_extra.zip |
SHA | 54f3b58c3efb4bf1c02a144683dd6431fcb606fb884ce7b1d853f9bffb71b6b4 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mlflow-deploy:v2023.4.1-4351 |
Architecture | None |
Updated At | 2024-23-May 19:56:57 |
Set Deployment Configuration
The deployment configuration sets the resources assigned to the LLM and the LLM Validation Listener model. For this example:
- LLM: 6 cpus, 10 Gi RAM
- In-Line Monitor: 2 cpus, 8 Gi
deployment_config = DeploymentConfigBuilder() \
.cpus(2).memory('2Gi') \
.sidekick_cpus(validation_model, 2) \
.sidekick_memory(validation_model, '8Gi') \
.sidekick_cpus(llama, 6) \
.sidekick_memory(llama, '10Gi') \
.build()
Deploy Models
We deploy assign both models to the same pipeline, the LLM assigned first, and the Monitoring model second to score the results of the LLM. These are deployed with the defined deployment configuration.
See Model Deploy for more details on deploying LLMs in Wallaroo.
pipeline = wl.build_pipeline("llm-summ-quality-1")
pipeline.add_model_step(llama)
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
Once deployment is complete, we can check the deployment status.
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.208.2.14',
'name': 'engine-5b8586f4c8-fbzkx',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'llm-summ-quality-1',
'status': 'Running'}]},
'model_statuses': {'models': [{'name': 'llamav3-llamacpp-passthrough-1',
'version': '71993033-561b-455d-89ea-933f112eb523',
'sha': '54f3b58c3efb4bf1c02a144683dd6431fcb606fb884ce7b1d853f9bffb71b6b4',
'status': 'Running'},
{'name': 'summquality',
'version': '14fca0ba-69d1-44b0-9fbb-ff39c07884b8',
'sha': 'c221cf1cab35c089847138aeac5a2e179430fa45fbddd281bcb1614876541c81',
'status': 'Running'}]}}],
'engine_lbs': [{'ip': '10.208.2.12',
'name': 'engine-lb-dcd9c8cd7-f64hr',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.208.2.13',
'name': 'engine-sidekick-summquality-204-585d4466ff-hr2gv',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'},
{'ip': '10.208.0.2',
'name': 'engine-sidekick-llamav3-llamacpp-passthrough-1-208-5fcd894vlc76',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Sample LLM and Validation Monitor Inference
We perform an inference by submitting an Apache Arrow table to the deployed LLM and LLM Validation Listener, and displaying the results. Apache arrow tables provide low latency methods of data transmission and inference.
The following fields are output from the inference:
out.generated_text
: The LLM’s generated text.out.score
: The quality score.
text = "Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located.Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."
input_data = pa.Table.from_pydict({"text" : [text]})
pipeline.infer(input_data, timeout=600)
pyarrow.Table
time: timestamp[ms]
in.text: string not null
out.generated_text: string not null
out.score: float not null
check_failures: int8
----
time: [[2024-05-23 20:08:00.423]]
in.text: [["Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located.Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."]]
out.generated_text: [[" Here's a summary of the text:
This AI technology simplifies and streamlines self-checkout processes for retail stores, allowing them to offer efficient and modern shopping experiences at scale. It reduces technical complexity and makes it easy to deploy AI-driven self-checkout solutions across multiple locations. The system eliminates checkout delays, drives operational efficiencies by reducing labor costs, and enables continuous improvement through data insights, ensuring a consistent customer experience regardless of location."]]
out.score: [[0.837221]]
check_failures: [[0]]
Undeploy the Models
With the tutorial complete, we undeploy the LLMs to return the resources back to the cluster.
pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ..................................... ok
name | llm-summ-quality-1 |
---|---|
created | 2024-05-23 20:01:16.874284+00:00 |
last_updated | 2024-05-23 20:01:16.935710+00:00 |
deployed | False |
arch | None |
tags | |
versions | 1c9e9ec2-3dc9-4ef1-94e1-e9e2b6266d2c, b7c2c259-f900-471c-9ccd-cf3f95085969 |
steps | llamav3-llamacpp-passthrough-1 |
published | False |
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today