RAG LLM Deployment Tutorial

Table of Contents

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

RAG LLMs: Inference in Wallaroo

The following demonstrates using a Bidirectional Attentive Autoencoder for Inducing Semantics (BAAI) general embedding (BGE) model with a RAG LLM to perform inference requests through Wallaroo agaisnt a vector database. The vector database is pre-embedded from the same BAAI BGE model. See the accompanying notebook “RAG LLMs: Automated Vector Database Enrichment in Wallaroo”.

This process uses Wallaroo features to:

  • Receive an inference request from a requester.
  • Convert the inference request into an embedding.
  • Request from the vector database data based on the embedding.
  • Generate the response from the RAG LLM with the appropriate context, and return the final result to the requester.

For this example, the Mongo Atlas Vector Database is used as the representational database.

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.

Tutorial Steps

Imports

We start by importing the libraries used for the tutorial.

import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

This step sets a connection to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload BGE Model

Before uploading the BGE model, we define the input and output schemas in Apache PyArrow Schema format.

input_schema = pa.schema([
    pa.field('text', pa.string())
])
output_schema = pa.schema([
    pa.field('embedding', 
        pa.list_(
            pa.float64(), list_size=768
        ),
    ),
    pa.field('text', pa.string())
])

The BGE model is a Hugging Face model in a Wallaroo BYOP framework in the file byop_bge_base2.zip. We upload it to Wallaroo via the wallaroo.client.Client.upload_model method, providing the following parameters:

  • The name to assign to the BGE model.
  • The file path to upload the model.
  • The Framework set to wallaroo.framework.Framework.CUSTOM for our Hugging Face model encapsulated in the BYOP framework.
  • The input and output schemas.

For more information, see the Wallaroo Model Upload guide.

bge = wl.upload_model('byop-bge-base-v2', 
    'byop_bge_base2.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
)
bge

Upload Modified RAG Llama LLM

For the RAG LLM, we use a modified Llama.cpp LLM as our RAG LLM. The RAG LLM uses the embedding from the BGE model to query the vector database index, and uses that result as the context to generate the text.

As before, we set the input and output schemas, then upload the model.

input_schema = pa.schema([
    pa.field('text', pa.string()),
    pa.field('embedding', pa.list_(pa.float32(), list_size=768))
]) 

output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
])
llama = wl.upload_model('byop-llamacpp-rag-v1', 
    'byop_llamacpp_rag.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
)
llama

Deploy BGE and RAG LLM

The models are deployed by:

  1. Setting the deployment configuration that sets the resources allocated from the cluster for the BGE and LLMs exclusive use. The following settings are used:
    1. BGE: 4 cpus, 3 Gi RAM
    2. LLM: 4 cpus, 6 Gi RAM
  2. Adding the BGE model and LLM to a Wallaroo pipeline as model steps.
  3. Deploy the models. Once the deployment is complete, they are ready to accept inference requests.
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(bge, 4) \
    .sidekick_memory(bge, '3Gi') \
    .sidekick_cpus(llama, 4) \
    .sidekick_memory(llama, '6Gi') \
    .build()
pipeline = wl.build_pipeline("byop-rag-llm-bge-v1")
pipeline.add_model_step(bge)
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)

Inference

Inference requests are submitted either as pandas DataFrames or Apache Arrow tables. The following example shows submitting a pandas DataFrame with the query to suggest an action movie. The response is returned as a pandas DataFrame, and we extract the generated text from there.

data = pd.DataFrame({"text": ["Suggest me an action movie, including it's name"]})
result = pipeline.infer(data, timeout=10000)
result['out.generated_text'].values[0]
'1. "The Battle of Algiers" (1966) - This film follows the story of the National Liberation Front (FLN) fighters during the Algerian Revolution, and their struggle against French colonial rule. 2. "The Goodfather" (1977) - A mobster's rise to power is threatened by his weaknesses, including his loyalty to his family and his own moral code. 3. "Dog Day Afternoon" (1975) - A desperate bank clerk turns to a life of crime when he can't pay his bills, but things spiral out'

Undeploy

pipeline.undeploy()

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.