Retrieval-Augmented Generation LLMs
Table of Contents
Retrieval-Augmented Generation (RAG) LLMs provide organization the ability to deploy LLMs with guide rails to prevent LLM hallucination. The following guides demonstrate:
- RAG LLM Prerequisites
- RAG LLM Deployment and example inferences
The following LLM models have been tested with Wallaroo, with the majority packaged as BYOP models.
- llama v2 7B standard on 1 GPU
- llama v2 7B chat on 1 GPU
- llama v2 7B instruct on 1 GPU
- llama v2 70B quantized on 1 GPU
- llama v3 8B standard on 1 GPU
- llama v2 8B instruct on 1 GPU
- llama v2 7B Chat quantized with llamacpp on ARM and X86
- llama v3 8B instruct quantized with llamacpp on ARM and X86
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Prerequisites
Before inferencing with the RAG LLM, the following prerequisites must be met.
- Vector indexed database: RAG LLM’s require that the vector indexed database, such as MongoDB Atlas Vector Search.
- Embedder model: This processes the incoming text into a vector used by the RAG LLM to query the database for the context. From the context, the RAG LLM is able to build the response without hallucinations.
Retrieval-Augmented Generation (RAG) LLMs deployed to Wallaroo utilize a two-step pipeline, with two models:
- A Wallaroo BYOP framework model that implements the feature extractor to create the embedding from the input text. For this example, the bge-base-en model is used.
- A RAG LLM packaged in a Wallaroo BYOP framework model.
Overview
The RAG LLM process takes the following steps:
- Inference requests first pass through the feature extractor model that outputs the embedding. This is a list of floats that the RAG LLM uses to query the database for its context.
- Both the embedding and the origin input is passed to the RAG LLM.
- The RAG LLM queries the vector database for the context from which to build it’s response. This context prevents hallucinations by providing guide rails that the RAG LLM uses to construct its response.
- Once finished, the response is submitted as the inference output.
Details on the Feature Extractor model and the an example of querying the database by its vector index are available below.
Feature Extractor Details
The feature extractor performs two functions:
- Passes the input text to the RAG LLM.
- Converts the input text into the embedding that the RAG LLM uses to query the database for the proper context.
The following Bring Your Own Predict (BYOP) code snippet demonstrates the predict
function that receives the input data, tokenizes it, and then extracts the embeddings from the model. The embeddings are then normalized and returned alongside the original input text.
In our two-step pipeline, this output is then passed to the RAG LLM.
def _predict(self, input_data: InferenceData):
inputs = input_data["text"].tolist()
texts = np.array([str(x) for x in input_data["text"]])
encoded_inputs = self.model["tokenizer"](
inputs, padding=True, truncation=True, return_tensors="pt"
)
with torch.no_grad():
model_output = self.model["model"](**encoded_inputs)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(
sentence_embeddings, p=2, dim=1
)
embeddings = np.array(
[sentence_embeddings[i].cpu().numpy() for i in range(len(inputs))]
)
return {"embedding": embeddings, "text": texts}
LLM Details
The following sample RAG LLM packaged as a BYOP framework model performs the following:
- Receives the input query text and the embedding generated by the Feature Extractor Model.
- Query the MongoDB Atlas database vector index based on the embedding as the context.
- This example queries the 10 most similar documents to the input based on the provided context.
- Using the returned data as context, generate the response based on the input query.
The BYOP predict
function shown below processes the request from the RAG LLM with the context model.
def _predict(self, input_data: InferenceData):
db = client.sample_mflix
collection = db.movies
generated_texts = []
prompts = input_data["text"].tolist()
embeddings = input_data["embedding"].tolist()
for prompt, embedding in zip(prompts, embeddings):
query_results = collection.aggregate(
[
{
"$vectorSearch": {
"queryVector": embedding,
"path": "plot_embedding_hf",
"numCandidates": 50,
"limit": 10,
"index": "PlotSemanticSearch",
}
}
]
)
context = " ".join([result["plot"] for result in query_results])
result = self.model(
f"Q: {prompt} C: {context} A: ",
max_tokens=512,
stop=["Q:", "\n"],
echo=False,
)
generated_texts.append(result["choices"][0]["text"])
return {"generated_text": np.array(generated_texts)}
Upload the Models
Both the feature extractor model and the RAG LLM are uploaded via either the Wallaroo SDK or the Wallaroo MLOps API.
Both models are packaged in the Wallaroo BYOP framework.
Upload via the Wallaroo SDK
ML models and LLMs are uploaded via the Wallaroo SDK through the wallaroo.client.Client.upload_model
method with the following parameters. For more details and parameters on model uploads, see Model Upload.
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework . |
input_schema | pyarrow.lib.Schema
| The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema
| The output schema in Apache Arrow schema format. |
Feature Extractor Upload
The Feature Extractor uses the following input and output schemas. Note the output is both the original text
input, plus the vectorized embedding as a List of Floats with a length of 768 columns.
import pyarrow as pa
input_schema = pa.schema([
pa.field('text', pa.string())
])
output_schema = pa.schema([
pa.field('embedding',
pa.list_(
pa.float64(), list_size=768
),
),
pa.field('text', pa.string())
])
The model is uploaded from the file ./models/byop_bge_base_pipe.zip
with the name byop-bge-base-pipe-v1
, specifying the framework, input and output schemas as required. Once uploaded, the model version reference is saved to the variable bge
.
import wallaroo
# establish the Wallaroo client
wl = wallaroo.Client()
bge = wl.upload_model('byop-bge-base-pipe-v1',
'./models/byop_bge_base_pipe.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema,
)
RAG LLM Upload
The RAG LLM is uploaded as a packaged Wallaroo BYOP framework with the following input and output schemas. Note that the RAG LLM’s input schema accepts the same outputs as the feature extractor model. The embedding
input is used to retrieve the context from the vectorized database.
This example demonstrates with a quantized version of Llama V2 Chat that leverages the llamacpp library.
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
import pyarrow as pa
input_schema = pa.schema([
pa.field('text', pa.string()),
pa.field('embedding', pa.list_(pa.float32(), list_size=768))
])
output_schema = pa.schema([
pa.field('generated_text', pa.string()),
])
With the input and output schemas set, we upload our RAG LLM packaged in the Wallaroo BYOP framework and saves the model version reference to the variable rag-llm
.
rag-llm = wl.upload_model('byop-llamacpp-rag-v1',
'./models/byop_llamacpp_rag.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema,
)
Upload via the Wallaroo MLOps API
Models and LLMs are uploaded via the Wallaroo MLOps API endpoint POST /v1/api/models/upload_and_convert
.
This endpoint has the following settings. For full examples of uploading models via the Wallaroo MLOps API, see Wallaroo MLOps API Essentials Guide: Model Upload and Registrations
- Endpoint:
POST /v1/api/models/upload_and_convert
- Headers
- Authorization:
Bearer {token}
: The authentication token for a Wallaroo user with access to the workspace the model is uploaded to. See link:#how-to-generate-the-mlops-api-authentication-token[How to generate the MLOps API Authentication Token]. - Content-Type:
multipart/form-data
Files are uploaded in themultipart/form-data
format with two parts:metadata
: Provides the model parameter data as Content-Typeapplication/json
. See Upload Model to Workspace Parameters.file
: The binary file (ONNX, .zip, etc) as Content-Typeapplication/octet-stream
.
- Authorization:
How to generate the MLOps API Authentication Token
The API based upload process requires an authentication token. The following is required to retrieve the token.
The Wallaroo instance authentication service address: The Wallaroo Authentication Service URL for the Wallaroo instance. By default, this is the Wallaroo Domain with the suffix
/auth
. For example, if the Wallaroo Domain Name iswallaroo.example.com
, the authentication service URL iswallaroo.example.com/auth
. For more details on Wallaroo and DNS services, see the link:https://docs.wallaroo.ai/202403/wallaroo-platform-operations/wallaroo-platform-operations-configure/wallaroo-dns-guide/[Wallaroo DNS Configuration Guide].The confidential client:
sdk-client
.The Wallaroo username making the MLOps API request: Typically this is the user’s email address.
The Wallaroo user’s password.
$WALLAROO_USERNAME
: The user name of the entity authenticating to the Wallaroo model ops center.$WALLAROO_PASSWORD
: The password for the entity authenticating to the Wallaroo model ops center.$WALLAROO_AUTH_URL
: The authentication URL.
The following example shows retrieving the authentication token using curl
. Update the variables based on your instance.
export WALLAROO_USERNAME="username"
export WALLAROO_PASSWORD="password"
export WALLAROO_AUTH_URL="wallaroo.example.com/{{param wallaroo_auth_dns_path}}"
TOKEN=$(curl -s -X POST "https://$WALLAROO_AUTH_URL/realms/master/protocol/openid-connect/token" \
-d client_id=sdk-client \
-d username=${WALLAROO_USERNAME} \
-d password=${WALLAROO_PASSWORD} \
-d grant_type=password | jq .access_token )
echo $TOKEN
"abc123"
MLOPs API Upload LLM Model Parameters
The following parameters are part of the metadata
Field | Type | Description |
---|---|---|
name | String (Required) | The model name. |
visibility | String (Required) | Either public or private . |
workspace_id | String (Required) | The numerical ID of the workspace to upload the model to. |
conversion | Dict (Required) | The conversion parameters. The following values of conversion.* are parameters of this field. |
conversion.framework | String (Required) | The framework of the model being uploaded. See the list of supported models for more details. |
conversion.python_version | String (Required) | The version of Python required for model. |
conversion.requirements | String (Required) | Required libraries. Can be [] if the requirements are default Wallaroo JupyterHub libraries. |
conversion.input_schema | String (Optional) | The input schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode . Only required for Containerized Wallaroo Runtime models. See link:#how-to-convert-input-and-output-schema-to-base64-format[How to Convert Input and Output Schema to Base64 Format] |
conversion.output_schema | String (Optional) | The output schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode . Only required for non-native runtime models. See link:#how-to-convert-input-and-output-schema-to-base64-format[How to Convert Input and Output Schema to Base64 Format] |
How to Convert Input and Output Schema to Base64 Format
Models packaged as Wallaroo Containerized Runtimes require the input and output schema formatted in Apache Arrow PyArrow Schema format encoded in base64. The following demonstrates converting a Apache Arrow PyArrow Schema to base64.
import pyarrow as pa
import base64
input_schema = pa.schema([
pa.field('input_1', pa.list_(pa.float32(), list_size=10)),
pa.field('input_2', pa.list_(pa.float32(), list_size=5))
])
output_schema = pa.schema([
pa.field('output_1', pa.list_(pa.float32(), list_size=3)),
pa.field('output_2', pa.list_(pa.float32(), list_size=2))
])
encoded_input_schema = base64.b64encode(
bytes(input_schema.serialize())
).decode("utf8")
encoded_output_schema = base64.b64encode(
bytes(output_schema.serialize())
).decode("utf8")
MLOPs API Upload LLM Model Example
The following examples demonstrate upload a model via the Wallaroo MLOps API using curl
. These uses the following environmental variables:
$TOKEN
: The bearer authentication token.$NAME
: The name of the model.$WORKSPACE_ID
: The workspace to upload the model to.$FRAMEWORK
: The model framework.$INPUT_SCHEMA
: The input schema in PyArrow Schema converted to Base64 format.$OUTPUT_SCHEMA
: The output schema in PyArrow Schema converted to Base64 format.
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $TOKEN" \
-F 'metadata=\
{ \
"name": $NAME, \
"visibility": "private", \
"workspace_id": $WORKSPACE_ID, \
"conversion": \
{\
"framework": $FRAMEWORK, \
"python_version": "3.8", \
"requirements": []}, \
"input_schema": $INPUT_SCHEMA, \
"output_schema": $OUTPUT_SCHEMA, \
"arch": 'x86' \
};type=application/json' \
-F "file=@$MODEL_PATH;type=application/octet-stream" \
$URL/v1/api/models/upload_and_convert | cat
RAG LLM Inference Example
The following example assumes that the two models are already uploaded and saved to the following variables:
bge
: The Feature Extractor that generates the embedding for the RAG LLM.rag-llm
: The RAG LLM that uses the embedding to query the vector database index, and uses that result as the context to generate the text.
With the models uploaded, they are deployed in a Wallaroo pipeline through the following process:
- Define the deployment configuration: This sets what resources are applied to each model on deployment. For more details, see Deployment Configuration.
- Add the feature extractor model and RAG LLM as model steps: This sets the structure where the feature extractor model converts the request to a vector, which is used as the input by the RAG LLM to generate the final response.
- Deploy the models: This step allocates resources to the feature extractor and LLM. At this point, the models are ready for inference requests.
Set the Deployment Configuration
The deployment configuration sets the following resources to the models:
- Feature extractor
- 4 CPUs
- 3 Gi RAM
- RAG LLM
- 4 CPUs
- 6 Gi RAM
This is represented in the following code.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(bge, 4) \
.sidekick_memory(bge, '3Gi') \
.sidekick_cpus(rag-llm, 4) \
.sidekick_memory(rag-llm, '6Gi') \
.build()
Deployment configurations are adjusted as required based on attributes including model size, throughput, latency, and performance requirements. Deployment configuration changes do not impact Wallaroo Inference endpoints (including name, url, etc), providing no interruption for production deployments.
Set the Model Steps
In this step, we add the feature extractor model and the RAG LLM as pipeline steps.
We create the pipeline with the wallaroo.client.Client.build_pipeline
, then add each model as pipeline steps with the feature extractor as the first step with the wallaroo.pipeline.Pipeline.add_model_step
method. This sets the stage for the feature extractor model to provide its outputs as the inputs for the RAG LLM.
pipeline = wl.build_pipeline("byop-rag-llm-bge-v1")
pipeline.add_model_step(bge)
pipeline.add_model_step(rag-llm)
Deploy the RAG LLM
Everything is now set and we deploy the models through the wallaroo.pipeline.Pipeline.deploy(deployment_config)
method, providing the deployment configuration we set earlier. This assigns the resources from the cluster to the model’s exclusive use.
Once the deployment is complete, the RAG LLM is ready for inference requests.
pipeline.deploy(deployment_config=deployment_config)
Inference Example
Inference requests are submitted either as pandas DataFrames or Apache Arrow tables. The following example shows submitting a pandas DataFrame with the query to suggest an action movie. The response is returned as a pandas DataFrame, and we extract the generated text from there.
data = pd.DataFrame({"text": ["Suggest me an action movie, including it's name"]})
result = pipeline.infer(data)
print(result['out.generated_text'].values[0])
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Tutorials
For tutorials on using RAG LLMs and using Wallaroo Inference Automation with the embedding model to populate the vector database, see RAG LLM Tutorials.
FAQs
Why use MongoDB Atlas for RAG-LLMs?
MongoDB’s recently released Atlas Vector Search allows for the integration of vector search capabilities into MongoDB Atlas. This allows for the creation of a vector index on a collection and the ability to perform vector searches on that collection. This is particularly useful for RAG-LLMs as it allows for the integration of vector search capabilities into the llama.cpp
pipeline. Wallaroo supports other vector database connections. Please contact us to learn more!