LLM Harmful Language Listener Tutorial


This tutorial can be downloaded as part of the Wallaroo Tutorials repository.

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.

The following tutorial demonstrates the Llama 3 70b Instruct Q5 Large Language Model (LLM) with a Harmful Language Listener. This provides validation monitoring to detect language that could be considered harmful: obscene, racist, insulting, or other benchmarks.

This tutorial demonstrates how to:

  • Upload the LLM and the Harmful Language Listener.
  • Create a Wallaroo pipeline and set the LLM and then the Listener as pipeline steps.
  • Deploy the models and perform sample inferences.

Model Overview

The LLM used in this demonstrates has the following attributes.

The Harmful Language Listener used in this demonstration has the following attributes:

  • Framework: vllm for more optimized model deployment, uploaded to Wallaroo in the Wallaroo Custom Model aka Bring Your Own Predict (BYOP) Framework.
  • Artifacts: The Listener model is encapsulated as part of the BYOP framework.
  • Input/Output Types: The Listener takes the following inputs and outputs.
    • Listener Input:
      • text (String): The original input text to the LLM.
      • generated_text (String): The text created by the LLM. This will be evaluated by the Listener for any harmful language.
    • Listener Output:
      • harmful (Boolean): Determines if the generated_text is harmful.
      • reasoning (String): The reasons why the generated_text is considered harmful or not.
      • confidence (Float): The confidence the model has of whether the generated_text is harmful or now.
      • generated_text (String): The text generated by the LLM. This is passed on as part of the Listener’s output.

Tutorial Steps

Import Libraries

We start by importing the required libraries. This includes the following:

  • Wallaroo SDK: Used to upload and deploy the model in Wallaroo.
  • pyarrow: Models uploaded to Wallaroo are defined in the input/output format.
  • pandas: Data is submitted to models deployed in Wallaroo as either Apache Arrow Table format or pandas Record Format as a DataFrame.
import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
from wallaroo.dynamic_batching_config import DynamicBatchingConfig

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

A connection to Wallaroo is set through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload the LLM

To upload the LLM and Listener, we use the wallaroo.client.Client.upload_model method which takes the following parameters.

  • The name to assign to the LLM.
  • The file path to upload the LLM.
  • The Framework set to wallaroo.framework.Framework.CUSTOM for our Hugging Face model encapsulated in the BYOP framework.
  • The input and output schemas.

For more information, see the Wallaroo Model Upload guide.

First we’ll set the input and output schemas for our LLM in Apache PyArrow Schema format.

input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("text", pa.string()),
    pa.field("generated_text", pa.string())
])

Then issue the upload command. For this example, we’ll add a model configuration to specify Dynamic Batching for LLMs which improves the performance of LLMs.

llm = wl.upload_model('llama-cpp-sdk-safeguards', 
    './models/byop_llamacpp_safeguards.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
).configure(input_schema=input_schema,
            output_schema=output_schema,
            dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000, 
                                                          batch_size_target=8)
            )
llm
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime......
Model is attempting loading to a container runtime............successful

Ready
Namellama-cpp-sdk-safeguards
Version9c03eaa2-d0d4-4adb-86a1-26df7bf3eb33
File Namebyop_llamacpp_safeguards.zip
SHA45752b3566691a641787abd9b1b9d94809f8a74d545283d599e8a2cdc492d110
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.4.0-5845
Architecturex86
Accelerationnone
Updated At2024-17-Dec 16:43:42
Workspace id5
Workspace namejohn.hansarick@wallaroo.ai - Default Workspace

Next we upload the Listener in the same process: define the input and output schemas, and then upload the model.

Note that for the Listener, the inputs are the LLM’s outputs. The Listener includes with its outputs the LLM’s generated_text field so it is passed back to the original receiver.

#Safeguards Harmful Language Listener
#Define schemas
input_schema = pa.schema([
    pa.field("text", pa.string()),
    pa.field("generated_text", pa.string())
])

output_schema = pa.schema([
    pa.field("harmful", pa.bool_()),
    pa.field("reasoning", pa.string()),
    pa.field("confidence", pa.float32()),
    pa.field("generated_text", pa.string())
])
#upload harmful language listener
listener = wl.upload_model('byop-safeguards-harmful-5', 
    './models/byop-safeguards-harmful.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
)
listener
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime..............................successful

Ready
Namebyop-safeguards-harmful-5
Version6a71d544-89de-411e-97a7-2a5dc5cd92f6
File Namebyop-safeguards-harmful.zip
SHAc41ff30b7032262e6ceffed2da658a44d16e698c1e826c3526b6a2379c8d2b1b
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.4.0-5845
Architecturex86
Accelerationnone
Updated At2024-17-Dec 16:46:34
Workspace id5
Workspace namejohn.hansarick@wallaroo.ai - Default Workspace

Deployment

For our deployment, we deploy both the LLM and Listener in the same pipeline as pipeline steps. Input provided to the pipeline is submitted first to the LLM. The output from the LLM is then the input to the Listener, and the Listener’s output is then provided back to the requester.

The deployment configuration sets the resources allocated for the LLM and the Listener with the following options:

  • LLM
    • CPUs: 6
    • Memory: 10 Gi
  • Harmful Language Listener
    • CPUs: 2
    • Memory: 10 Gi
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(llm, 6) \
    .sidekick_memory(llm, '10Gi') \
    .sidekick_cpus(listener, 2) \
    .sidekick_memory(listener, '10Gi') \
    .sidekick_env(listener, json.load(open("credentials.json", 'r'))) \
    .build()

The Wallaroo pipeline is created with the build_pipeline method. The LLM and Listener are set as the pipeline steps, then deployed with the previously defined deployment configuration.

pipeline = wl.build_pipeline("safeguards-llamacpp")
pipeline.add_model_step(llm)
pipeline.add_model_step(listener)
pipeline.deploy(deployment_config=deployment_config, wait_for_status=False)
Deployment initiated for safeguards-llamacpp. Please check pipeline status.
namesafeguards-llamacpp
created2024-12-17 16:52:43.998692+00:00
last_updated2024-12-17 16:52:44.049227+00:00
deployedTrue
workspace_id5
workspace_namejohn.hansarick@wallaroo.ai - Default Workspace
archx86
accelnone
tags
versions7ecd5285-6576-4539-99b0-a067a88836c1, 3641c4fe-c5be-46e6-bd94-93326d57ede2
stepsllama-cpp-sdk-safeguards
publishedFalse

Once deployed, we’ll check on the status. When the status is Running, we continue to the inference steps.

# check the pipeline status before performing an inference

import time
time.sleep(15)

while pipeline.status()['status'] != 'Running':
   time.sleep(15)
   pipeline.status()['status']

pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.28.3.7',
   'name': 'engine-576b7f5b4-m9h9j',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'safeguards-llamacpp',
      'status': 'Running',
      'version': '7ecd5285-6576-4539-99b0-a067a88836c1'}]},
   'model_statuses': {'models': [{'model_version_id': 1,
      'name': 'llama-cpp-sdk-safeguards',
      'sha': '45752b3566691a641787abd9b1b9d94809f8a74d545283d599e8a2cdc492d110',
      'status': 'Running',
      'version': '9c03eaa2-d0d4-4adb-86a1-26df7bf3eb33'},
     {'model_version_id': 2,
      'name': 'byop-safeguards-harmful-5',
      'sha': 'c41ff30b7032262e6ceffed2da658a44d16e698c1e826c3526b6a2379c8d2b1b',
      'status': 'Running',
      'version': '6a71d544-89de-411e-97a7-2a5dc5cd92f6'}]}}],
 'engine_lbs': [{'ip': '10.28.2.7',
   'name': 'engine-lb-6676794678-r4w99',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.28.3.8',
   'name': 'engine-sidekick-byop-safeguards-harmful-5-2-654fb4d87f-c5jp8',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'},
  {'ip': '10.28.2.8',
   'name': 'engine-sidekick-llama-cpp-sdk-safeguards-1-6d8bfddc4-nfhhb',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Inference

For our inference, we submit either a pandas DataFrame or Apache Arrow table with our text query. In this case: Describe what Wallaroo.AI is.

Once submitted, we display the harmful, confidence, and reason.

data = pd.DataFrame({'text': ['Describe what Wallaroo.AI is']})
result=pipeline.infer(data, timeout=10000)
result["out.confidence"][0]
0.95
result["out.harmful"][0]
False
result["out.reasoning"][0]
'This response provides a neutral and informative description of Wallaroo.ai without any potential biases or stereotypes.'

Undeploy the Models

With the tutorial complete, we undeploy the model and return the resources back to the cluster.

pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ...... ok
namesafeguards-llamacpp
created2024-12-17 16:52:43.998692+00:00
last_updated2024-12-17 16:52:44.049227+00:00
deployedFalse
workspace_id5
workspace_namejohn.hansarick@wallaroo.ai - Default Workspace
archx86
accelnone
tags
versions7ecd5285-6576-4539-99b0-a067a88836c1, 3641c4fe-c5be-46e6-bd94-93326d57ede2
stepsllama-cpp-sdk-safeguards
publishedFalse

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.