Retrieval-Augmented Generation LLMs


Table of Contents

Retrieval-Augmented Generation (RAG) LLMs provide organization the ability to deploy LLMs with guide rails to prevent LLM hallucination. The following guides demonstrate:

  • RAG LLM Prerequisites
  • RAG LLM Deployment and example inferences

The following LLM models have been tested with Wallaroo, with the majority packaged as BYOP models.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Prerequisites

Before inferencing with the RAG LLM, the following prerequisites must be met.

  • Vector indexed database: RAG LLM’s require that the vector indexed database, such as MongoDB Atlas Vector Search.
  • Embedder model: This processes the incoming text into a vector used by the RAG LLM to query the database for the context. From the context, the RAG LLM is able to build the response without hallucinations.

Retrieval-Augmented Generation (RAG) LLMs deployed to Wallaroo utilize a two-step pipeline, with two models:

Overview

The RAG LLM process takes the following steps:

  • Inference requests first pass through the feature extractor model that outputs the embedding. This is a list of floats that the RAG LLM uses to query the database for its context.
  • Both the embedding and the origin input is passed to the RAG LLM.
  • The RAG LLM queries the vector database for the context from which to build it’s response. This context prevents hallucinations by providing guide rails that the RAG LLM uses to construct its response.
  • Once finished, the response is submitted as the inference output.

Details on the Feature Extractor model and the an example of querying the database by its vector index are available below.

Feature Extractor Details

The feature extractor performs two functions:

  • Passes the input text to the RAG LLM.
  • Converts the input text into the embedding that the RAG LLM uses to query the database for the proper context.

The following Bring Your Own Predict (BYOP) code snippet demonstrates the predict function that receives the input data, tokenizes it, and then extracts the embeddings from the model. The embeddings are then normalized and returned alongside the original input text.

In our two-step pipeline, this output is then passed to the RAG LLM.

def _predict(self, input_data: InferenceData):
        inputs = input_data["text"].tolist()
        texts = np.array([str(x) for x in input_data["text"]])

        encoded_inputs = self.model["tokenizer"](
            inputs, padding=True, truncation=True, return_tensors="pt"
        )

        with torch.no_grad():
            model_output = self.model["model"](**encoded_inputs)
            sentence_embeddings = model_output[0][:, 0]

        sentence_embeddings = torch.nn.functional.normalize(
            sentence_embeddings, p=2, dim=1
        )

        embeddings = np.array(
            [sentence_embeddings[i].cpu().numpy() for i in range(len(inputs))]
        )

        return {"embedding": embeddings, "text": texts}

LLM Details

The following sample RAG LLM packaged as a BYOP framework model performs the following:

  • Receives the input query text and the embedding generated by the Feature Extractor Model.
  • Query the MongoDB Atlas database vector index based on the embedding as the context.
    • This example queries the 10 most similar documents to the input based on the provided context.
  • Using the returned data as context, generate the response based on the input query.

The BYOP predict function shown below processes the request from the RAG LLM with the context model.

def _predict(self, input_data: InferenceData):
        db = client.sample_mflix
        collection = db.movies

        generated_texts = []
        prompts = input_data["text"].tolist()
        embeddings = input_data["embedding"].tolist()

        for prompt, embedding in zip(prompts, embeddings):
            query_results = collection.aggregate(
                [
                    {
                        "$vectorSearch": {
                            "queryVector": embedding,
                            "path": "plot_embedding_hf",
                            "numCandidates": 50,
                            "limit": 10,
                            "index": "PlotSemanticSearch",
                        }
                    }
                ]
            )

            context = " ".join([result["plot"] for result in query_results])

            result = self.model(
                f"Q: {prompt} C: {context} A: ",
                max_tokens=512,
                stop=["Q:", "\n"],
                echo=False,
            )
            generated_texts.append(result["choices"][0]["text"])

        return {"generated_text": np.array(generated_texts)}

Upload the Models

Both the feature extractor model and the RAG LLM are uploaded via either the Wallaroo SDK or the Wallaroo MLOps API.

Both models are packaged in the Wallaroo BYOP framework.

Upload via the Wallaroo SDK

ML models and LLMs are uploaded via the Wallaroo SDK through the wallaroo.client.Client.upload_model method with the following parameters. For more details and parameters on model uploads, see Model Upload.

ParameterTypeDescription
namestring (Required)The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
pathstring (Required)The path to the model file being uploaded.
frameworkstring (Required)The framework of the model from wallaroo.framework.
input_schemapyarrow.lib.Schema
  • Native Wallaroo Runtimes: (Optional)
  • Non-Native Wallaroo Runtimes: (Required)
The input schema in Apache Arrow schema format.
output_schemapyarrow.lib.Schema
  • Native Wallaroo Runtimes: (Optional)
  • Non-Native Wallaroo Runtimes: (Required)
The output schema in Apache Arrow schema format.

Feature Extractor Upload

The Feature Extractor uses the following input and output schemas. Note the output is both the original text input, plus the vectorized embedding as a List of Floats with a length of 768 columns.

import pyarrow as pa

input_schema = pa.schema([
    pa.field('text', pa.string())
])

output_schema = pa.schema([
    pa.field('embedding', 
        pa.list_(
            pa.float64(), list_size=768
        ),
    ),
    pa.field('text', pa.string())
])

The model is uploaded from the file ./models/byop_bge_base_pipe.zip with the name byop-bge-base-pipe-v1, specifying the framework, input and output schemas as required. Once uploaded, the model version reference is saved to the variable bge.

import wallaroo

# establish the Wallaroo client

wl = wallaroo.Client()


bge = wl.upload_model('byop-bge-base-pipe-v1', 
    './models/byop_bge_base_pipe.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
)

RAG LLM Upload

The RAG LLM is uploaded as a packaged Wallaroo BYOP framework with the following input and output schemas. Note that the RAG LLM’s input schema accepts the same outputs as the feature extractor model. The embedding input is used to retrieve the context from the vectorized database.

This example demonstrates with a quantized version of Llama V2 Chat that leverages the llamacpp library.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

import pyarrow as pa

input_schema = pa.schema([
    pa.field('text', pa.string()),
    pa.field('embedding', pa.list_(pa.float32(), list_size=768))
]) 

output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
])

With the input and output schemas set, we upload our RAG LLM packaged in the Wallaroo BYOP framework and saves the model version reference to the variable rag-llm.

rag-llm = wl.upload_model('byop-llamacpp-rag-v1', 
    './models/byop_llamacpp_rag.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
)

Upload via the Wallaroo MLOps API

Models and LLMs are uploaded via the Wallaroo MLOps API endpoint POST /v1/api/models/upload_and_convert.

This endpoint has the following settings. For full examples of uploading models via the Wallaroo MLOps API, see Wallaroo MLOps API Essentials Guide: Model Upload and Registrations

  • Endpoint: POST /v1/api/models/upload_and_convert
  • Headers
    • Authorization: Bearer {token}: The authentication token for a Wallaroo user with access to the workspace the model is uploaded to. See link:#how-to-generate-the-mlops-api-authentication-token[How to generate the MLOps API Authentication Token].
    • Content-Type: multipart/form-data Files are uploaded in the multipart/form-data format with two parts:
      • metadata: Provides the model parameter data as Content-Type application/json. See Upload Model to Workspace Parameters.
      • file: The binary file (ONNX, .zip, etc) as Content-Type application/octet-stream.

How to generate the MLOps API Authentication Token

The API based upload process requires an authentication token. The following is required to retrieve the token.

  • The Wallaroo instance authentication service address: The Wallaroo Authentication Service URL for the Wallaroo instance. By default, this is the Wallaroo Domain with the suffix /auth. For example, if the Wallaroo Domain Name is wallaroo.example.com, the authentication service URL is wallaroo.example.com/auth. For more details on Wallaroo and DNS services, see the link:https://docs.wallaroo.ai/202402/wallaroo-platform-operations/wallaroo-platform-operations-configure/wallaroo-dns-guide/[Wallaroo DNS Configuration Guide].

  • The confidential client: sdk-client.

  • The Wallaroo username making the MLOps API request: Typically this is the user’s email address.

  • The Wallaroo user’s password.

  • $WALLAROO_USERNAME: The user name of the entity authenticating to the Wallaroo model ops center.

  • $WALLAROO_PASSWORD: The password for the entity authenticating to the Wallaroo model ops center.

  • $WALLAROO_AUTH_URL: The authentication URL.

The following example shows retrieving the authentication token using curl. Update the variables based on your instance.

export WALLAROO_USERNAME="username"
export WALLAROO_PASSWORD="password"
export WALLAROO_AUTH_URL="wallaroo.example.com/{{param wallaroo_auth_dns_path}}"

TOKEN=$(curl -s -X POST "https://$WALLAROO_AUTH_URL/realms/master/protocol/openid-connect/token" \
                 -d client_id=sdk-client \
                 -d username=${WALLAROO_USERNAME} \
                 -d password=${WALLAROO_PASSWORD} \
                 -d grant_type=password | jq .access_token )
echo $TOKEN

"abc123"

MLOPs API Upload LLM Model Parameters

The following parameters are part of the metadata

FieldTypeDescription
nameString (Required)The model name.
visibilityString (Required)Either public or private.
workspace_idString (Required)The numerical ID of the workspace to upload the model to.
conversionDict (Required)The conversion parameters. The following values of conversion.* are parameters of this field.
conversion.frameworkString (Required)The framework of the model being uploaded. See the list of supported models for more details.
conversion.python_versionString (Required)The version of Python required for model.
conversion.requirementsString (Required)Required libraries. Can be [] if the requirements are default Wallaroo JupyterHub libraries.
conversion.input_schemaString (Optional)The input schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode. Only required for Containerized Wallaroo Runtime models. See link:#how-to-convert-input-and-output-schema-to-base64-format[How to Convert Input and Output Schema to Base64 Format]
conversion.output_schemaString (Optional)The output schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode. Only required for non-native runtime models. See link:#how-to-convert-input-and-output-schema-to-base64-format[How to Convert Input and Output Schema to Base64 Format]

How to Convert Input and Output Schema to Base64 Format

Models packaged as Wallaroo Containerized Runtimes require the input and output schema formatted in Apache Arrow PyArrow Schema format encoded in base64. The following demonstrates converting a Apache Arrow PyArrow Schema to base64.

import pyarrow as pa
import base64

input_schema = pa.schema([
    pa.field('input_1', pa.list_(pa.float32(), list_size=10)),
    pa.field('input_2', pa.list_(pa.float32(), list_size=5))
])
output_schema = pa.schema([
    pa.field('output_1', pa.list_(pa.float32(), list_size=3)),
    pa.field('output_2', pa.list_(pa.float32(), list_size=2))
])

encoded_input_schema = base64.b64encode(
                bytes(input_schema.serialize())
            ).decode("utf8")

encoded_output_schema = base64.b64encode(
                bytes(output_schema.serialize())
            ).decode("utf8")

MLOPs API Upload LLM Model Example

The following examples demonstrate upload a model via the Wallaroo MLOps API using curl. These uses the following environmental variables:

  • $TOKEN: The bearer authentication token.
  • $NAME: The name of the model.
  • $WORKSPACE_ID: The workspace to upload the model to.
  • $FRAMEWORK: The model framework.
  • $INPUT_SCHEMA: The input schema in PyArrow Schema converted to Base64 format.
  • $OUTPUT_SCHEMA: The output schema in PyArrow Schema converted to Base64 format.
curl --progress-bar -X POST \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $TOKEN" \
  -F 'metadata=\ 
      { \
        "name": $NAME, \
        "visibility": "private", \
        "workspace_id": $WORKSPACE_ID, \
        "conversion": \
          {\
            "framework": $FRAMEWORK, \
            "python_version": "3.8", \
            "requirements": []}, \
            "input_schema": $INPUT_SCHEMA, \
            "output_schema": $OUTPUT_SCHEMA, \
            "arch": 'x86' \
          };type=application/json' \
  -F "file=@$MODEL_PATH;type=application/octet-stream" \
$URL/v1/api/models/upload_and_convert | cat

RAG LLM Inference Example

The following example assumes that the two models are already uploaded and saved to the following variables:

  • bge: The Feature Extractor that generates the embedding for the RAG LLM.
  • rag-llm: The RAG LLM that uses the embedding to query the vector database index, and uses that result as the context to generate the text.

With the models uploaded, they are deployed in a Wallaroo pipeline through the following process:

  • Define the deployment configuration: This sets what resources are applied to each model on deployment. For more details, see Deployment Configuration.
  • Add the feature extractor model and RAG LLM as model steps: This sets the structure where the feature extractor model converts the request to a vector, which is used as the input by the RAG LLM to generate the final response.
  • Deploy the models: This step allocates resources to the feature extractor and LLM. At this point, the models are ready for inference requests.

Set the Deployment Configuration

The deployment configuration sets the following resources to the models:

  • Feature extractor
    • 4 CPUs
    • 3 Gi RAM
  • RAG LLM
    • 4 CPUs
    • 6 Gi RAM

This is represented in the following code.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(bge, 4) \
    .sidekick_memory(bge, '3Gi') \
    .sidekick_cpus(rag-llm, 4) \
    .sidekick_memory(rag-llm, '6Gi') \
    .build()

Deployment configurations are adjusted as required based on attributes including model size, throughput, latency, and performance requirements. Deployment configuration changes do not impact Wallaroo Inference endpoints (including name, url, etc), providing no interruption for production deployments.

Set the Model Steps

In this step, we add the feature extractor model and the RAG LLM as pipeline steps.

We create the pipeline with the wallaroo.client.Client.build_pipeline, then add each model as pipeline steps with the feature extractor as the first step with the wallaroo.pipeline.Pipeline.add_model_step method. This sets the stage for the feature extractor model to provide its outputs as the inputs for the RAG LLM.

pipeline = wl.build_pipeline("byop-rag-llm-bge-v1")
pipeline.add_model_step(bge)
pipeline.add_model_step(rag-llm)

Deploy the RAG LLM

Everything is now set and we deploy the models through the wallaroo.pipeline.Pipeline.deploy(deployment_config) method, providing the deployment configuration we set earlier. This assigns the resources from the cluster to the model’s exclusive use.

Once the deployment is complete, the RAG LLM is ready for inference requests.

pipeline.deploy(deployment_config=deployment_config)

Inference Example

Inference requests are submitted either as pandas DataFrames or Apache Arrow tables. The following example shows submitting a pandas DataFrame with the query to suggest an action movie. The response is returned as a pandas DataFrame, and we extract the generated text from there.

data = pd.DataFrame({"text": ["Suggest me an action movie, including it's name"]})

result = pipeline.infer(data)
print(result['out.generated_text'].values[0])

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Tutorials

For tutorials on using RAG LLMs and using Wallaroo Inference Automation with the embedding model to populate the vector database, see RAG LLM Tutorials.

FAQs

Why use MongoDB Atlas for RAG-LLMs?

MongoDB’s recently released Atlas Vector Search allows for the integration of vector search capabilities into MongoDB Atlas. This allows for the creation of a vector index on a collection and the ability to perform vector searches on that collection. This is particularly useful for RAG-LLMs as it allows for the integration of vector search capabilities into the llama.cpp pipeline. Wallaroo supports other vector database connections. Please contact us to learn more!