RAG LLM Deployment Tutorial
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
RAG LLMs: Inference in Wallaroo
The following demonstrates using a Bidirectional Attentive Autoencoder for Inducing Semantics (BAAI) general embedding (BGE) model with a RAG LLM to perform inference requests through Wallaroo agaisnt a vector database. The vector database is pre-embedded from the same BAAI BGE model. See the accompanying notebook “RAG LLMs: Automated Vector Database Enrichment in Wallaroo”.
This process uses Wallaroo features to:
- Receive an inference request from a requester.
- Convert the inference request into an embedding.
- Request from the vector database data based on the embedding.
- Generate the response from the RAG LLM with the appropriate context, and return the final result to the requester.
For this example, the Mongo Atlas Vector Database is used as the representational database.
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today.
Tutorial Steps
Imports
We start by importing the libraries used for the tutorial.
import json
import os
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
This step sets a connection to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Upload BGE Model
Before uploading the BGE model, we define the input and output schemas in Apache PyArrow Schema format.
input_schema = pa.schema([
pa.field('text', pa.string())
])
output_schema = pa.schema([
pa.field('embedding',
pa.list_(
pa.float64(), list_size=768
),
),
pa.field('text', pa.string())
])
The BGE model is a Hugging Face model in a Wallaroo BYOP framework in the file byop_bge_base2.zip
. We upload it to Wallaroo via the wallaroo.client.Client.upload_model
method, providing the following parameters:
- The name to assign to the BGE model.
- The file path to upload the model.
- The Framework set to
wallaroo.framework.Framework.CUSTOM
for our Hugging Face model encapsulated in the BYOP framework. - The input and output schemas.
For more information, see the Wallaroo Model Upload guide.
bge = wl.upload_model('byop-bge-base-v2',
'byop_bge_base2.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema,
)
bge
Upload Modified RAG Llama LLM
For the RAG LLM, we use a modified Llama.cpp LLM as our RAG LLM. The RAG LLM uses the embedding from the BGE model to query the vector database index, and uses that result as the context to generate the text.
As before, we set the input and output schemas, then upload the model.
input_schema = pa.schema([
pa.field('text', pa.string()),
pa.field('embedding', pa.list_(pa.float32(), list_size=768))
])
output_schema = pa.schema([
pa.field('generated_text', pa.string()),
])
llama = wl.upload_model('byop-llamacpp-rag-v1',
'byop_llamacpp_rag.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema,
)
llama
Deploy BGE and RAG LLM
The models are deployed by:
- Setting the deployment configuration that sets the resources allocated from the cluster for the BGE and LLMs exclusive use. The following settings are used:
- BGE: 4 cpus, 3 Gi RAM
- LLM: 4 cpus, 6 Gi RAM
- Adding the BGE model and LLM to a Wallaroo pipeline as model steps.
- Deploy the models. Once the deployment is complete, they are ready to accept inference requests.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(bge, 4) \
.sidekick_memory(bge, '3Gi') \
.sidekick_cpus(llama, 4) \
.sidekick_memory(llama, '6Gi') \
.build()
pipeline = wl.build_pipeline("byop-rag-llm-bge-v1")
pipeline.add_model_step(bge)
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
Inference
Inference requests are submitted either as pandas DataFrames or Apache Arrow tables. The following example shows submitting a pandas DataFrame with the query to suggest an action movie. The response is returned as a pandas DataFrame, and we extract the generated text from there.
data = pd.DataFrame({"text": ["Suggest me an action movie, including it's name"]})
result = pipeline.infer(data, timeout=10000)
result['out.generated_text'].values[0]
'1. "The Battle of Algiers" (1966) - This film follows the story of the National Liberation Front (FLN) fighters during the Algerian Revolution, and their struggle against French colonial rule. 2. "The Goodfather" (1977) - A mobster's rise to power is threatened by his weaknesses, including his loyalty to his family and his own moral code. 3. "Dog Day Afternoon" (1975) - A desperate bank clerk turns to a life of crime when he can't pay his bills, but things spiral out'
Undeploy
pipeline.undeploy()
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today.