LLM Validation Listeners

LLM Validation Listeners are in-line monitors that validate LLMs’ inferences during the inference process. These validations are implemented as an in-line step in the same Wallaroo pipeline with the LLM. These validations are customized for whatever monitoring the user request, such as summary quality, translation quality score, and other use cases.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

A LLM Validation Listener follows this process:

Each validation step is uploaded as in Wallaroo Custom Model aka Bring Your Own Predict (BYOP) or Hugging Face model into Wallaroo. These models monitor the outputs of the LLM and score them based on whatever criteria the data scientist developers.
These model steps evaluate inference data directly from the LLM, creating additional fields based on the LLM’s inference output.
- For example, if the LLM outputs the field text, the validation model could output the fields summary_quality, translation_quality_score, etc.
These steps are monitored with Wallaroo assays to analyze the scores each validation step produces and publish assay analyses based on established criteria.

LLM Validation Listener Example - Summarization

The following condensed example provides in-line monitoring for a Llama v3 Llamacpp LLM. This LLM was previously uploaded and set to the variable llamav3_llamacpp. For a full example, see the tutorial LLM In-Line Monitoring Example or download the sample Jupyter Notebooks from the Wallaroo Tutorials repository.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

This LLM has the following inputs and outputs:

Inputs
- text: String
Outputs
- generated_text: String

We add the BYOP model summarisation_quality_final.zip which evaluates the LLM’s generated_text output and scores it. This has the following inputs and outputs:

Inputs
- text: String
- generated_text: String ; This is the output of the Llama V3 model.
Outputs
- generated_text: String ; This is the same generated_text from the Llama v3 model, passed through as an inference output.
- score: Float64; The total score based on the generated_text field.

The following is an example of performing a sample inference the Llama v3 model deployed in Wallaroo from the pipeline pipeline_llm. For details on uploading and deploying LLMs to Wallaroo, see How to Upload and Deploy LLM Models in Wallaroo.

# set the input_data
text = "Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located.Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."

# convert to an Apache Arrow table
input_data = pa.Table.from_pydict({"text" : [text]})

# perform the inference
result = pipeline_llm.infer(input_data)

# display the result, output as an Apache Arrow table
display(result)

pyarrow.Table
time: timestamp[ms]
in.text: string not null
out.generated_text: string not null
out.score: float not null
check_failures: int8
----
time: [[2024-05-23 19:17:37.617]]
in.text: [["Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located.Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."]]
out.generated_text: [[" Here's a summary of the text:

This AI technology simplifies and streamlines self-checkout processes for retail stores, allowing them to offer efficient and modern shopping experiences at scale. It reduces technical complexity and makes it easy to deploy AI-driven self-checkout solutions across multiple locations. The system eliminates checkout delays, drives operational efficiencies by reducing labor costs, and enables continuous improvement through data insights, ensuring a consistent customer experience regardless of location."]]
anomaly.count: [[0]]

Upload and Deploy LLM Validation Listener

The BYOP model summarisation_quality_final.zip is uploaded and deployed as a model step within the same pipeline as the LLM is deployed in.

# set the input and output schemas
input_schema = pa.schema([
    pa.field('text', pa.string()),
    pa.field('generated_text', pa.string())
]) 

output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('score', pa.float64()),
])

# upload the model with the BYOP framework
validation_model = wl.upload_model('summquality', 
    'summarisation_quality_final.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
)

With the validation model uploaded, we add it to the same pipeline as the Llama v3 LLM. The models are provided with the following resources:

llamav3_llamacpp: 6 cpus, 10 Gi RAM
validation_model: 2 cpus, 8 Gi RAM

# create the pipeline
pipeline_llm = wl.build_pipeline("llm-summ-quality")

# add the model steps
pipeline.add_model_step(llamav3_llamacpp)
pipeline.add_model_step(validation_model)

# set the deployment configuration
deployment_config = DeploymentConfigBuilder() \
    .cpus(2).memory('2Gi') \
    .sidekick_cpus(model, 2) \
    .sidekick_memory(model, '8Gi') \
    .sidekick_cpus(llama, 6) \
    .sidekick_memory(llama, '10Gi') \
    .build()

# deploy the LLM and validation model with the deployment configuration
pipeline.deploy(deployment_config=deployment_config)

LLM Validation Listener Sample Inference

Once deployed, we perform the same inference with the same inputs. The validation model will add the additional field score with the inference output.

# perform the inference
result = pipeline_llm.infer(input_data)

# display the result, output as an Apache Arrow table
display(result)

pyarrow.Table
time: timestamp[ms]
in.text: string not null
out.generated_text: string not null
out.score: float not null
check_failures: int8
----
time: [[2024-05-23 20:08:00.423]]
in.text: [["Please summarize this text: Simplify production AI for seamless self-checkout or cashierless experiences at scale, enabling any retail store to offer a modern shopping journey. We reduce the technical overhead and complexity for delivering a checkout experience that’s easy and efficient no matter where your stores are located.Eliminate Checkout Delays: Easy and fast model deployment for a smooth self-checkout process, allowing customers to enjoy faster, hassle-free shopping experiences. Drive Operational Efficiencies: Simplifying the process of scaling AI-driven self-checkout solutions to multiple retail locations ensuring uniform customer experiences no matter the location of the store while reducing in-store labor costs. Continuous Improvement: Enabling integrated data insights for informing self-checkout improvements across various locations, ensuring the best customer experience, regardless of where they shop."]]
out.generated_text: [[" Here's a summary of the text:

This AI technology simplifies and streamlines self-checkout processes for retail stores, allowing them to offer efficient and modern shopping experiences at scale. It reduces technical complexity and makes it easy to deploy AI-driven self-checkout solutions across multiple locations. The system eliminates checkout delays, drives operational efficiencies by reducing labor costs, and enables continuous improvement through data insights, ensuring a consistent customer experience regardless of location."]]
out.score: [[0.837221]]
anomaly.count: [[0]]

The out.score field is observed through Model Inference Results or by using Wallaroo assays.

LLM Validation Listener Example - Harmful Language Listener

The following example shows how a LLM with a Harmful Language Listener is deployed in Wallaroo. In this instance, the output of an LLM is evaluated by the Harmful Language Listener and returns:

harmful (Bool): Is the LLM output determined to be harmful or not.
reasoning (String): The reason why the LLM output is considered harmful. For example: racism, foul language, etc.
confidence (Float): The confidence the Listener has in its harmful or not determination.

As with other LLMs with in-line monitoring, the deployment process deploys the LLM and the Listener in one Wallaroo pipeline, with the output of the LLM providing the input for the Listener.

The following shows a generic upload the deployment code for the two models by:

Defining the input and output schema
Uploading the models.
Defining the Wallaroo pipeline with each of the model steps.
Deploying the models and executing a sample inference.

# upload the LLM
input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("text", pa.string()), # preserve the input text for the HLL
    pa.field("generated_text", pa.string()) # the output of the LLM
])

model = wl.upload_model('llm-name', 
    llm_file,
    framework=framework,
    input_schema=input_schema,
    output_schema=output_schema
)

# upload the HLL
input_schema = pa.schema([
    pa.field("text", pa.string()), # the original input text
    pa.field("generated_text", pa.string()) # the text generated by the LLM
])

output_schema = pa.schema([
    pa.field("harmful", pa.bool_()),
    pa.field("reasoning", pa.string()),
    pa.field("confidence", pa.float32()),
    pa.field("generated_text", pa.string())
])

listener = wl.upload_model('listener-name', 
    listener-file,
    framework=framework,
    input_schema=input_schema,
    output_schema=output_schema,
)

With both the LLM and the HLL uploaded, the deploy the models via the Wallaroo pipeline.

pipeline = wl.build_pipeline("llm-with-hll-listener")
pipeline.add_model_step(model)
pipeline.add_model_step(listener)
pipeline.deploy()

Once deployed, we perform our inferences.

data = pd.DataFrame({'text': ['Describe what Roland Garros is']})
result=pipeline.infer(data)
display(result)

	time	in.text	out.confidence	out.generated_text	out.harmful	out.reasoning	anomaly.count
0	2024-12-12 15:54:38.440	Describe what Rolland Garros is	0.95	’ Roland Garros, also known as the French Open, is a prestigious Grand Slam tennis tournament …'	False	This response provides a neutral and informati…	0

Tutorials

The following tutorials provide full details on uploading, deploying, and performing inferences on LLM with Validation Listeners.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

LLM Validation Listeners

Table of Contents

LLM Validation Listener Example - Summarization

Upload and Deploy LLM Validation Listener

LLM Validation Listener Sample Inference

LLM Validation Listener Example - Harmful Language Listener

Tutorials