LLM Validation Listener Example

Table of Contents

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

LLM Validation Listener Example

The following example demonstrates using LLM Validation Listener to evaluate LLM performance at inference time.

LLM Validation Listener validates LLMs’ inferences during the inference process. These validations are implemented as an in-line step in the same Wallaroo pipeline with the LLM. These validations are customized for whatever monitoring the user request, such as summary quality, translation quality score, and other use cases.

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.

LLM Validation Listeners follow this process:

  • Each validation step is uploaded as Bring Your Own Predict (BYOP)) or Hugging Face) model into Wallaroo. These models monitor the outputs of the LLM and score them based on whatever criteria the data scientist developers.
  • These model steps evaluate inference data directly from the LLM, creating additional fields based on the LLM’s inference output.
    • For example, if the LLM outputs the field text, the validation model’s outputs would be the fields summary_quality, translation_quality_score, etc.
  • These steps are monitored with Wallaroo assays to analyze the scores each validation step produces and publish assay analyses based on established criteria.

Tutorial Overview

This tutorial demonstrates the following:

  • Upload an LLM Validation Listener developed to evaluate the output of a Llama v3 Llamacpp LLM previously uploaded to Wallaroo.
  • Add the LLM Validation Listener in the same pipeline as the Llama v3 Llamacpp LLM.
  • Perform sample inference and show the how the LLM Validation Listener scores the LLM outputs.

Tutorial Steps

Import libraries

The first step is to import the libraries required.

import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Set Workspace

The following creates or connects to an existing workspace based on the variable workspace_name, and sets it as the current workspace. For more details on Wallaroo workspaces, see Wallaroo Workspace Management Guide.

Upload LLM Validation Listener Model

The LLM Validation Listener model is uploaded form the BYOP model summarisation_quality_final.zip, which is a Quality Summarization model that evaluates the LLM’s generated_text output and scores it. This has the following inputs and outputs:

  • Inputs
    • text: String
    • generated_text: String ; This is the output of the Llama V3 model.
  • Outputs
    • generated_text: String ; This is the same generated_text from the Llama v3 model, passed through as an inference output.
    • score: Float64; The total score based on the generated_text field.

Schema Definition

We set the model’s input and output schemas in Apache PyArrow Schema format.

input_schema = pa.schema([
    pa.field('text', pa.string()),
    pa.field('generated_text', pa.string())
]) 

output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('score', pa.float64()),
])

Upload the Model

We now upload the model as the framework wallaroo.framework.Framework.CUSTOM. For more details on uploading models, see Model Upload. We store the model version reference to the variable validation_model.

validation_model = wl.upload_model('summquality', 
    'summarisation_quality_final.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
)
display(validation_model)
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime...................................................................................................successful

Ready
Namesummquality
Version14fca0ba-69d1-44b0-9fbb-ff39c07884b8
File Namesummarisation_quality_final.zip
SHAc221cf1cab35c089847138aeac5a2e179430fa45fbddd281bcb1614876541c81
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mlflow-deploy:v2023.4.1-4351
ArchitectureNone
Updated At2024-23-May 18:27:38

Retrieve the LLM

If the LLM is already uploaded we retrieve it with the method wallaroo.client.Client.get_model.

llama = wl.get_model('llamav3-llamacpp-passthrough-1')
display(llama)
Namellamav3-llamacpp-passthrough-1
Version71993033-561b-455d-89ea-933f112eb523
File Namebyop_llamacpp_llama3_extra.zip
SHA54f3b58c3efb4bf1c02a144683dd6431fcb606fb884ce7b1d853f9bffb71b6b4
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mlflow-deploy:v2023.4.1-4351
ArchitectureNone
Updated At2024-23-May 19:56:57

Set Deployment Configuration

The deployment configuration sets the resources assigned to the LLM and the LLM Validation Listener model. For this example:

  • LLM: 6 cpus, 10 Gi RAM
  • In-Line Monitor: 2 cpus, 8 Gi
deployment_config = DeploymentConfigBuilder() \
    .cpus(2).memory('2Gi') \
    .sidekick_cpus(validation_model, 2) \
    .sidekick_memory(validation_model, '8Gi') \
    .sidekick_cpus(llama, 6) \
    .sidekick_memory(llama, '10Gi') \
    .build()

Deploy Models

We deploy assign both models to the same pipeline, the LLM assigned first, and the Monitoring model second to score the results of the LLM. These are deployed with the defined deployment configuration.

See Model Deploy for more details on deploying LLMs in Wallaroo.

pipeline = wl.build_pipeline("llm-summ-quality-1")
pipeline.add_model_step(llama)
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)

Once deployment is complete, we can check the deployment status.

pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.208.2.14',
   'name': 'engine-5b8586f4c8-fbzkx',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llm-summ-quality-1',
      'status': 'Running'}]},
   'model_statuses': {'models': [{'name': 'llamav3-llamacpp-passthrough-1',
      'version': '71993033-561b-455d-89ea-933f112eb523',
      'sha': '54f3b58c3efb4bf1c02a144683dd6431fcb606fb884ce7b1d853f9bffb71b6b4',
      'status': 'Running'},
     {'name': 'summquality',
      'version': '14fca0ba-69d1-44b0-9fbb-ff39c07884b8',
      'sha': 'c221cf1cab35c089847138aeac5a2e179430fa45fbddd281bcb1614876541c81',
      'status': 'Running'}]}}],
 'engine_lbs': [{'ip': '10.208.2.12',
   'name': 'engine-lb-dcd9c8cd7-f64hr',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.208.2.13',
   'name': 'engine-sidekick-summquality-204-585d4466ff-hr2gv',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'},
  {'ip': '10.208.0.2',
   'name': 'engine-sidekick-llamav3-llamacpp-passthrough-1-208-5fcd894vlc76',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Sample LLM and Validation Monitor Inference

We perform an inference by submitting an Apache Arrow table to the deployed LLM and LLM Validation Listener, and displaying the results. Apache arrow tables provide low latency methods of data transmission and inference.

The following fields are output from the inference:

  • out.generated_text: The LLM’s generated text.
  • out.score: The quality score.
text = "Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located.Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."
input_data = pa.Table.from_pydict({"text" : [text]})
pipeline.infer(input_data, timeout=600)
pyarrow.Table
time: timestamp[ms]
in.text: string not null
out.generated_text: string not null
out.score: float not null
check_failures: int8
----
time: [[2024-05-23 20:08:00.423]]
in.text: [["Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located.Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."]]
out.generated_text: [[" Here's a summary of the text:

This AI technology simplifies and streamlines self-checkout processes for retail stores, allowing them to offer efficient and modern shopping experiences at scale. It reduces technical complexity and makes it easy to deploy AI-driven self-checkout solutions across multiple locations. The system eliminates checkout delays, drives operational efficiencies by reducing labor costs, and enables continuous improvement through data insights, ensuring a consistent customer experience regardless of location."]]
out.score: [[0.837221]]
check_failures: [[0]]

Undeploy the Models

With the tutorial complete, we undeploy the LLMs to return the resources back to the cluster.

pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ..................................... ok
namellm-summ-quality-1
created2024-05-23 20:01:16.874284+00:00
last_updated2024-05-23 20:01:16.935710+00:00
deployedFalse
archNone
tags
versions1c9e9ec2-3dc9-4ef1-94e1-e9e2b6266d2c, b7c2c259-f900-471c-9ccd-cf3f95085969
stepsllamav3-llamacpp-passthrough-1
publishedFalse

For access to these sample models and for a demonstration: