Deploy RAG Llama with OpenAI compatibility on QAIC


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Deploy Custom LLM using QAIC Acceleration with a MongoDB Vector Database Connection for RAG with OpenAI API Compatibility

The following tutorial demonstrates deploying a Llama LLM using QAIC Acceleration with Retrieval-Augmented Generation (RAG) in Wallaroo with OpenAI API compatibility enabled. This allows developers to:

  • Take advantage of Wallaroo’s inference optimization to increase inference response times with more efficient resource allocation.
  • Increase the speed of LLM inferences with QAIC’s AI acceleration with lower power costs.
  • Migrate existing OpenAI client code with a minimum of changes.
  • Extend their LLMs capabilities with the Wallaroo Custom Model framework to add RAG functionality to an existing LLM.

Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:

  • wallaroo.framework.Framework.VLLM: Native async vLLM implementations.
  • wallaroo.framework.Framework.CUSTOM: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.

A typical situation is to either deploy the native vLLM runtime as a single model in a Wallaroo pipeline, or both the Custom Model runtime and the native vLLM runtime together in the same pipeline to extend the LLMs capabilities. In this tutorial, RAG is added to improve the context of inference requests to provide better responses and prevent AI hallucinations.

This example uses one model for RAG, and one LLM with OpenAI compatibility enabled.

For access to these sample models and for a demonstration:

Tutorial Outline

This tutorial demonstrates how to:

  • Upload a LLM with the Wallaroo native vLLM framework and a Wallaroo Custom Model with the Custom Model framework, with QAIC acceleration enabled.
  • Configure the uploaded LLM to enable OpenAI API compatibility and set additional OpenAI parameters.
  • Set resource configurations for allocating cpus, memory, etc.
  • Set the Custom Model runtime and native vLLM runtime as pipeline steps and deploy in Wallaroo.
  • Submit inference request via:
    • The Wallaroo SDK methods completions and chat_completion
    • Wallaroo pipeline inference urls with OpenAI API endpoints extensions.

Tutorial Requirements

The following tutorial requires the following:

  • Wallaroo version 2025.1 and above.
  • Tiny Llama model and the Wallaroo RAG Custom Model. These are available from Wallaroo representatives upon request.

Tutorial Steps

Import Libraries

The following libraries are used for this tutorial, primarily the Wallaroo SDK.

import base64 
import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.framework import CustomConfig, VLLMConfig
from wallaroo.engine_config import QaicConfig
from wallaroo.object import EntityNotFoundError
from wallaroo.engine_config import Acceleration
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.openai_config import OpenaiConfig

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload the Wallaroo Native vLLM Runtime

The model is uploaded with the following parameters:

  • The model name.
  • The file path to the model.
  • The framework set to Wallaroo native vLLM runtime: wallaroo.framework.Framework.VLLM
  • The input and output schemas are defined in Apache PyArrow format. For OpenAI compatibility, this is left as an empty List.
  • Acceleration is set to Qualcomm QAIC for the LLM. In this example, an acceleration configuration is applied with Acceleration.QAIC.with_config to find tune hardware performance.
qaic_config = QaicConfig(
    num_devices=4, 
    full_batch_size=16, 
    ctx_len=1024, 
    prefill_seq_len=128, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True
)

llama = wl.upload_model(
    "llama-qaic-openai", 
    "llama-31-8b.zip", 
    framework=Framework.VLLM,
    framework_config=VLLMConfig(
        max_num_seqs=16,
        max_model_len=1024,
        max_seq_len_to_capture=128, 
        quantization="mxfp6",
        kv_cache_dtype="mxint8", 
        gpu_memory_utilization=1
    ),
    input_schema=pa.schema([]),
    output_schema=pa.schema([]), 
    accel=Acceleration.QAIC.with_config(qaic_config)
)
Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime..
Model is attempting loading to a container runtime...................................................................................................................................................................................................................................................
Successful
Ready
llama = wl.get_model("llama-qaic-openai")

Enable OpenAI Compatibility and Continuous Batch Config

OpenAI compatibility and continuous batch config options are enabled at the model configuration after the model is uploaded.

OpenAI compatibility is enabled via the model configuration from the class wallaroo.openai_config.OpenaiConfig includes the following main parameters. The essential one is enabled - if OpenAI compatibility is not enabled, all other parameters are ignored.

ParameterTypeDescription
enabledBoolean (Default: False)If True, OpenAI compatibility is enabled. If False, OpenAI compatibility is not enabled. All other parameters are ignored if enabled=False.
completion_configDictThe OpenAI API completion parameters. All completion parameters are available except stream; the stream parameter is only set at inference requests.
chat_completion_configDictThe OpenAI API chat/completion parameters. All completion parameters are available except stream; the stream parameter is only set at inference requests.

With the OpenaiConfig object defined, it is when applied to the LLM configuration through the openai_config parameter.

cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)

openai_config = OpenaiConfig(
    enabled=True,
    completion_config={
        "temperature": .3,
        "max_tokens": 200
    },
    chat_completion_config={
        "temperature": .3,
        "max_tokens": 200,
        "chat_template": """
        {% for message in messages %}
            {% if message['role'] == 'user' %}
                {{ '<|user|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'system' %}
                {{ '<|system|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'assistant' %}
                {{ '<|assistant|>\n'  + message['content'] + eos_token }}
            {% endif %}
            
            {% if loop.last and add_generation_prompt %}
                {{ '<|assistant|>' }}
            {% endif %}
        {% endfor %}"""
    })
llama = llama.configure(continuous_batching_config=cbc,
                        openai_config=openai_config)
llama
Namellama-qaic-openai
Version0c97b5ba-daac-4688-8d8e-fc1f0bcd9b9d
File Namellama-31-8b.zip
SHA62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-qaic-vllm:v2025.1.0-6231
Architecturex86
Acceleration{'qaic': {'ctx_len': 1024, 'num_cores': 16, 'num_devices': 4, 'mxfp6_matmul': True, 'full_batch_size': 16, 'mxint8_kv_cache': True, 'prefill_seq_len': 128, 'aic_enable_depth_first': False}}
Updated At2025-02-Jul 17:54:00
Workspace id9
Workspace nameyounes@wallaroo.ai - Default Workspace

Upload Embedder Model

The RAG embedder model is uploaded with the Wallaroo Custom Model framework. This allows for flexibility with Python scripts to handle requesting the context from the Mongo database through a vector query. Once uploaded, the configuration is updated to include OpenAI compatibility.

Custom Model Framework

The embedder model includes the following artifacts:

In this example, the requirements.txt file is:

sentence_transformers==4.1.0
pymongo==4.7.1

The script for our openai_step.py file is as follows:

from sentence_transformers import SentenceTransformer
import pymongo

model = SentenceTransformer("BAAI/bge-base-en")  # runs on CPU by default
client = pymongo.MongoClient("mongodb+srv://wallaroo_user:random123@example.wallaroo.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0")
db = client.sample_mflix
collection = db.movies

def lookup_context(text: str):
    embedding = model.encode(
        text,
        normalize_embeddings=True,
        convert_to_numpy=True,
    ).tolist()

    query_results = collection.aggregate(
            [
                {
                    "$vectorSearch": {
                        "queryVector": embedding,
                        "path": "plot_embedding_hf",
                        "numCandidates": 50,
                        "limit": 10,
                        "index": "PlotSemanticSearch",
                    }
                }
            ]
        )
    context = " ".join([result["plot"] for result in query_results])
    return context[:100]

def handle_chat_completion(request: dict) -> dict:
    messages = request["messages"]
    
    # Extract last 3 user messages
    user_text = "\n".join([m["content"] for m in messages if m.get("role") == "user"][-3:])
    context = lookup_context(user_text)
    
    # Inject as system message at the top
    context_msg = {"role": "system", "content": f"Context: {context}"}
    request["messages"] = [context_msg] + messages
    return request

def handle_completion(request: dict) -> dict:
    prompt = request.get("prompt", "")
    context = lookup_context(prompt)
    request["prompt"] = f"Context: {context}\n\n{prompt}"
    return request
# Uploading the model

rag_step = wl.upload_model(
    "ragstep",
    "rag_step.zip",
    framework=Framework.CUSTOM,
    input_schema=pa.schema([]),
    output_schema=pa.schema([]),
)
Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime..
Model is attempting loading to a container runtime...........
Successful
Ready
rag_step = wl.get_model("ragstep")
openai_config_rag = OpenaiConfig(enabled=True)
rag_step = rag_step.configure(openai_config=openai_config_rag)
rag_step
Nameragstep
Version161d2b87-ffa4-4bbf-b5ce-036e5dcd1db4
File Namerag_step.zip
SHA5d47b8229b4410b63eb52af11f91a6c6e45eaa681e765624be2adec339f427a3
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-6231
Architecturex86
Accelerationnone
Updated At2025-02-Jul 18:01:55
Workspace id9
Workspace nameyounes@wallaroo.ai - Default Workspace

Set the Deployment Configuration and Deploy

The deployment configuration defines what resources are allocated to the LLM’s exclusive use. For this tutorial, the LLM is allocated:

  • Llama LLM:
    • 4 cpus
    • 12 Gi RAM
    • 4 GPU. The GPU type is inherited from the model upload step. For QAIC, each deployment configuration gpu values is the number of System-on-Chip (SoC) to use.
  • RAG Model:
    • 1 cpu
    • 2 Gi RAM

Once the deployment configuration is set:

  • The pipeline is created.
  • The RAG model and the LLM added as a pipeline steps.
  • The pipeline is deployed with the deployment configuration.

Once the deployment is complete, the LLM is ready to receive inference requests.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('1Gi') \
    .sidekick_cpus(rag_step, 1) \
    .sidekick_memory(rag_step, '2Gi') \
    .sidekick_cpus(llama, 4) \
    .sidekick_memory(llama, '12Gi') \
    .sidekick_gpus(llama, 4) \
    .deployment_label("kubernetes.io/os:linux") \
    .build()
pipeline = wl.build_pipeline('llama-openai-ragyns')
pipeline.undeploy()
pipeline.clear()
pipeline.add_model_step(rag_step)
pipeline.add_model_step(llama)
pipeline.deploy(deployment_config = deployment_config)
pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.244.69.143',
   'name': 'engine-7fb7bcb47d-ssfxj',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-openai-ragyns',
      'status': 'Running',
      'version': '1c8d179a-8dea-4e7b-8ffd-5d57b1707d6c'}]},
   'model_statuses': {'models': [{'model_version_id': 113,
      'name': 'llama-qaic-openai',
      'sha': '62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838',
      'status': 'Running',
      'version': '0c97b5ba-daac-4688-8d8e-fc1f0bcd9b9d'},
     {'model_version_id': 114,
      'name': 'ragstep',
      'sha': '5d47b8229b4410b63eb52af11f91a6c6e45eaa681e765624be2adec339f427a3',
      'status': 'Running',
      'version': '161d2b87-ffa4-4bbf-b5ce-036e5dcd1db4'}]}}],
 'engine_lbs': [{'ip': '10.244.69.152',
   'name': 'engine-lb-7765599d45-9k6f9',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.244.69.130',
   'name': 'engine-sidekick-ragstep-114-6b7456f84c-wv2gq',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'},
  {'ip': '10.244.69.132',
   'name': 'engine-sidekick-llama-qaic-openai-113-5d9945ffd8-9jlfc',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Inference Requests on LLM with OpenAI Compatibility Enabled

Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom with OpenAI compatibility enabled in Wallaroo are performed either through the Wallaroo SDK, or via OpenAPI endpoint requests.

OpenAI API inference requests on models deployed with OpenAI compatibility enabled have the following conditions:

  • Parameters for chat/completion and completion override the existing OpenAI configuration options.
  • If the stream option is enabled:
    • Outputs returned as list of chunks aka as an event stream.
    • The request inference call completes when all chunks are returned.
    • The response metadata includes ttft, tps and user-specified OpenAI request params after the last chunk is generated.

OpenAI API Inference Requests via the Wallaroo SDK and Inference Result Logs

Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:

  • wallaroo.pipeline.Pipeline.openai_chat_completion: Submits an inference request using the OpenAI API chat/completion endpoint parameters.
  • wallaroo.pipeline.Pipeline.openai_completion: Submits an inference request using the OpenAI API completion endpoint parameters.

The OpenAI metrics are provided as part of the pipeline inference logs and include the following values:

  • ttft
  • tps
  • The OpenAI request parameter values set during the inference request.

The method wallaroo.pipeline.Pipeline.logs returns a pandas DataFrame by default, with the output fields labeled out.{field}. For OpenAI inference requests, the OpenAI metrics output field is out.json. The following demonstrates retrieving the inference results log and displaying the out.json field, which includes the tps and ttft fields.

OpenAI API Inference Requests via Pipeline Deployment URLs with OpenAI Extensions

Native vLLM runtimes and Wallaroo Custom Models with OpenAI enabled perform inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions. For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:

  • {Deployment inference endpoint}/openai/v1/completions: Compatible with the OpenAI API endpoint completion.
  • {Deployment inference endpoint}/openai/v1/chat/completions: Compatible with the OpenAI API endpoint chat/completion.

These requests require the following:

  • A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
  • Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
  • Access to the deployed pipeline’s OpenAPI API endpoints.

Inference and Inference Results Logs Examples

The following demonstrates performing an inference request using openai_chat_completion.

pipeline.openai_chat_completion(messages=[{"role": "user", "content": "closest movie title to good morning"}]).choices[0].message.content
'The closest movie title to "good morning" is likely "Good Morning, Vietnam" (1987), a comedy-drama film directed by Barry Levinson and starring Robin Williams.'

The following demonstrates performing an inference request using openai_chat_completion with token streaming enabled.

for chunk in pipeline.openai_completion(prompt="Give me the title of a good movie from the 1990's", max_tokens=500, stream=True):
    print(chunk.choices[0].text, end="", flush=True)
 that fits this description.

## Step 1: Identify key elements of the movie description
The movie is told in flashbacks, it's about an older man's obsession for a woman who can belong to no-one, and it's from the 1990's.

## Step 2: Consider movies from the 1990's that fit the description
One movie that fits this description is "The English Patient" (1996), but it's not the only one. Another movie that fits is "The Piano" (1993), but it's not the one I'm thinking of.

## Step 3: Think of another movie that fits the description
A movie that fits the description is "The Piano" is not it, but "The Piano" is a good guess, another movie that fits is "The English Patient" is not it, but "The English Patient" is a good guess, but the movie I'm thinking of is "The Piano" is not it, but "The English Patient" is not it, but "The Piano" is a good guess, but the movie I'm thinking of is "The English Patient" is not it, but I think I have it.

## Step 4: Identify the movie
The movie I'm thinking of is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient-shift" no, I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient

The following demonstrates using the pipeline infernence url with the OpenAI extension endpoints for completions. First the authentication token is retrieved, then the inference request made.

token = wl.auth.auth_header()['Authorization'].split()[1]
token
'abc123'
# Streaming: Completion
!curl -X POST \
  -H "Authorization: Bearer abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "", "prompt": "Give me the title of a good movie from the 1990s", "max_tokens": 100, "stream": true, "stream_options": {"include_usage": true}}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/llama-openai-ragyns-63/llama-openai-ragyns/openai/v1/completions
data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":",","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":1,"total_tokens":43,"ttft":0.10214721,"tps":9.789792594433074}}

data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":" and","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":2,"total_tokens":44,"ttft":0.10214721,"tps":14.570481440179023}}

data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":" I","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":3,"total_tokens":45,"ttft":0.10214721,"tps":13.070685403999592}}

...

data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":" from","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":99,"total_tokens":141,"ttft":0.10214721,"tps":9.90478908735264}}

data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":" the","index":0,"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":100,"total_tokens":142,"ttft":0.10214721,"tps":9.905616236577528}}

data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[],"usage":{"prompt_tokens":42,"completion_tokens":100,"total_tokens":142,"ttft":0.10214721,"tps":9.905460621690295}}

data: [DONE]

For access to these sample models and for a demonstration: