LLM Harmful Language Listener Tutorial
This tutorial can be downloaded as part of the Wallaroo Tutorials repository.
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
The following tutorial demonstrates the Llama 3 70b Instruct Q5 Large Language Model (LLM) with a Harmful Language Listener. This provides validation monitoring to detect language that could be considered harmful: obscene, racist, insulting, or other benchmarks.
This tutorial demonstrates how to:
- Upload the LLM and the Harmful Language Listener.
- Create a Wallaroo pipeline and set the LLM and then the Listener as pipeline steps.
- Deploy the models and perform sample inferences.
Model Overview
The LLM used in this demonstrates has the following attributes.
- Framework:
vllm
for more optimized model deployment, uploaded to Wallaroo in the Wallaroo Custom Model aka Bring Your Own Predict (BYOP) Framework. - Artifacts: The original model is the Llama 3 8B Instruct Hugging Face model:Llama 3 8B Instruct
- Input/Output Types: Both the input and outputs are text.
The Harmful Language Listener used in this demonstration has the following attributes:
- Framework:
vllm
for more optimized model deployment, uploaded to Wallaroo in the Wallaroo Custom Model aka Bring Your Own Predict (BYOP) Framework. - Artifacts: The Listener model is encapsulated as part of the BYOP framework.
- Input/Output Types: The Listener takes the following inputs and outputs.
- Listener Input:
text
(String): The original input text to the LLM.generated_text
(String): The text created by the LLM. This will be evaluated by the Listener for any harmful language.
- Listener Output:
harmful
(Boolean): Determines if thegenerated_text
is harmful.reasoning
(String): The reasons why thegenerated_text
is considered harmful or not.confidence
(Float): The confidence the model has of whether thegenerated_text
is harmful or now.generated_text
(String): The text generated by the LLM. This is passed on as part of the Listener’s output.
- Listener Input:
Tutorial Steps
Import Libraries
We start by importing the required libraries. This includes the following:
- Wallaroo SDK: Used to upload and deploy the model in Wallaroo.
- pyarrow: Models uploaded to Wallaroo are defined in the input/output format.
- pandas: Data is submitted to models deployed in Wallaroo as either Apache Arrow Table format or pandas Record Format as a DataFrame.
import json
import os
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
from wallaroo.dynamic_batching_config import DynamicBatchingConfig
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
A connection to Wallaroo is set through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Upload the LLM
To upload the LLM and Listener, we use the wallaroo.client.Client.upload_model
method which takes the following parameters.
- The name to assign to the LLM.
- The file path to upload the LLM.
- The Framework set to
wallaroo.framework.Framework.CUSTOM
for our Hugging Face model encapsulated in the BYOP framework. - The input and output schemas.
For more information, see the Wallaroo Model Upload guide.
First we’ll set the input and output schemas for our LLM in Apache PyArrow Schema format.
input_schema = pa.schema([
pa.field("text", pa.string())
])
output_schema = pa.schema([
pa.field("text", pa.string()),
pa.field("generated_text", pa.string())
])
Then issue the upload command. For this example, we’ll add a model configuration to specify Dynamic Batching for LLMs which improves the performance of LLMs.
llm = wl.upload_model('llama-cpp-sdk-safeguards',
'./models/byop_llamacpp_safeguards.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema
).configure(input_schema=input_schema,
output_schema=output_schema,
dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000,
batch_size_target=8)
)
llm
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime......
Model is attempting loading to a container runtime............successful
Ready
Name | llama-cpp-sdk-safeguards |
Version | 9c03eaa2-d0d4-4adb-86a1-26df7bf3eb33 |
File Name | byop_llamacpp_safeguards.zip |
SHA | 45752b3566691a641787abd9b1b9d94809f8a74d545283d599e8a2cdc492d110 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.4.0-5845 |
Architecture | x86 |
Acceleration | none |
Updated At | 2024-17-Dec 16:43:42 |
Workspace id | 5 |
Workspace name | john.hansarick@wallaroo.ai - Default Workspace |
Next we upload the Listener in the same process: define the input and output schemas, and then upload the model.
Note that for the Listener, the inputs are the LLM’s outputs. The Listener includes with its outputs the LLM’s generated_text
field so it is passed back to the original receiver.
#Safeguards Harmful Language Listener
#Define schemas
input_schema = pa.schema([
pa.field("text", pa.string()),
pa.field("generated_text", pa.string())
])
output_schema = pa.schema([
pa.field("harmful", pa.bool_()),
pa.field("reasoning", pa.string()),
pa.field("confidence", pa.float32()),
pa.field("generated_text", pa.string())
])
#upload harmful language listener
listener = wl.upload_model('byop-safeguards-harmful-5',
'./models/byop-safeguards-harmful.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema,
)
listener
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime..............................successful
Ready
Name | byop-safeguards-harmful-5 |
Version | 6a71d544-89de-411e-97a7-2a5dc5cd92f6 |
File Name | byop-safeguards-harmful.zip |
SHA | c41ff30b7032262e6ceffed2da658a44d16e698c1e826c3526b6a2379c8d2b1b |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.4.0-5845 |
Architecture | x86 |
Acceleration | none |
Updated At | 2024-17-Dec 16:46:34 |
Workspace id | 5 |
Workspace name | john.hansarick@wallaroo.ai - Default Workspace |
Deployment
For our deployment, we deploy both the LLM and Listener in the same pipeline as pipeline steps. Input provided to the pipeline is submitted first to the LLM. The output from the LLM is then the input to the Listener, and the Listener’s output is then provided back to the requester.
The deployment configuration sets the resources allocated for the LLM and the Listener with the following options:
- LLM
- CPUs: 6
- Memory: 10 Gi
- Harmful Language Listener
- CPUs: 2
- Memory: 10 Gi
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(llm, 6) \
.sidekick_memory(llm, '10Gi') \
.sidekick_cpus(listener, 2) \
.sidekick_memory(listener, '10Gi') \
.sidekick_env(listener, json.load(open("credentials.json", 'r'))) \
.build()
The Wallaroo pipeline is created with the build_pipeline
method. The LLM and Listener are set as the pipeline steps, then deployed with the previously defined deployment configuration.
pipeline = wl.build_pipeline("safeguards-llamacpp")
pipeline.add_model_step(llm)
pipeline.add_model_step(listener)
pipeline.deploy(deployment_config=deployment_config, wait_for_status=False)
Deployment initiated for safeguards-llamacpp. Please check pipeline status.
name | safeguards-llamacpp |
---|---|
created | 2024-12-17 16:52:43.998692+00:00 |
last_updated | 2024-12-17 16:52:44.049227+00:00 |
deployed | True |
workspace_id | 5 |
workspace_name | john.hansarick@wallaroo.ai - Default Workspace |
arch | x86 |
accel | none |
tags | |
versions | 7ecd5285-6576-4539-99b0-a067a88836c1, 3641c4fe-c5be-46e6-bd94-93326d57ede2 |
steps | llama-cpp-sdk-safeguards |
published | False |
Once deployed, we’ll check on the status
. When the status
is Running
, we continue to the inference steps.
# check the pipeline status before performing an inference
import time
time.sleep(15)
while pipeline.status()['status'] != 'Running':
time.sleep(15)
pipeline.status()['status']
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.28.3.7',
'name': 'engine-576b7f5b4-m9h9j',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'safeguards-llamacpp',
'status': 'Running',
'version': '7ecd5285-6576-4539-99b0-a067a88836c1'}]},
'model_statuses': {'models': [{'model_version_id': 1,
'name': 'llama-cpp-sdk-safeguards',
'sha': '45752b3566691a641787abd9b1b9d94809f8a74d545283d599e8a2cdc492d110',
'status': 'Running',
'version': '9c03eaa2-d0d4-4adb-86a1-26df7bf3eb33'},
{'model_version_id': 2,
'name': 'byop-safeguards-harmful-5',
'sha': 'c41ff30b7032262e6ceffed2da658a44d16e698c1e826c3526b6a2379c8d2b1b',
'status': 'Running',
'version': '6a71d544-89de-411e-97a7-2a5dc5cd92f6'}]}}],
'engine_lbs': [{'ip': '10.28.2.7',
'name': 'engine-lb-6676794678-r4w99',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.28.3.8',
'name': 'engine-sidekick-byop-safeguards-harmful-5-2-654fb4d87f-c5jp8',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'},
{'ip': '10.28.2.8',
'name': 'engine-sidekick-llama-cpp-sdk-safeguards-1-6d8bfddc4-nfhhb',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Inference
For our inference, we submit either a pandas DataFrame or Apache Arrow table with our text query. In this case: Describe what Wallaroo.AI is
.
Once submitted, we display the harmful
, confidence
, and reason
.
data = pd.DataFrame({'text': ['Describe what Wallaroo.AI is']})
result=pipeline.infer(data, timeout=10000)
result["out.confidence"][0]
0.95
result["out.harmful"][0]
False
result["out.reasoning"][0]
'This response provides a neutral and informative description of Wallaroo.ai without any potential biases or stereotypes.'
Undeploy the Models
With the tutorial complete, we undeploy the model and return the resources back to the cluster.
pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ...... ok
name | safeguards-llamacpp |
---|---|
created | 2024-12-17 16:52:43.998692+00:00 |
last_updated | 2024-12-17 16:52:44.049227+00:00 |
deployed | False |
workspace_id | 5 |
workspace_name | john.hansarick@wallaroo.ai - Default Workspace |
arch | x86 |
accel | none |
tags | |
versions | 7ecd5285-6576-4539-99b0-a067a88836c1, 3641c4fe-c5be-46e6-bd94-93326d57ede2 |
steps | llama-cpp-sdk-safeguards |
published | False |
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today