Deploy RAG Llama with OpenAI compatibility on QAIC
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Deploy Custom LLM using QAIC Acceleration with a MongoDB Vector Database Connection for RAG with OpenAI API Compatibility
The following tutorial demonstrates deploying a Llama LLM using QAIC Acceleration with Retrieval-Augmented Generation (RAG) in Wallaroo with OpenAI API compatibility enabled. This allows developers to:
- Take advantage of Wallaroo’s inference optimization to increase inference response times with more efficient resource allocation.
- Increase the speed of LLM inferences with QAIC’s AI acceleration with lower power costs.
- Migrate existing OpenAI client code with a minimum of changes.
- Extend their LLMs capabilities with the Wallaroo Custom Model framework to add RAG functionality to an existing LLM.
Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:
wallaroo.framework.Framework.VLLM
: Native async vLLM implementations.wallaroo.framework.Framework.CUSTOM
: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.
A typical situation is to either deploy the native vLLM runtime as a single model in a Wallaroo pipeline, or both the Custom Model runtime and the native vLLM runtime together in the same pipeline to extend the LLMs capabilities. In this tutorial, RAG is added to improve the context of inference requests to provide better responses and prevent AI hallucinations.
This example uses one model for RAG, and one LLM with OpenAI compatibility enabled.
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Tutorial Outline
This tutorial demonstrates how to:
- Upload a LLM with the Wallaroo native vLLM framework and a Wallaroo Custom Model with the Custom Model framework, with QAIC acceleration enabled.
- Configure the uploaded LLM to enable OpenAI API compatibility and set additional OpenAI parameters.
- Set resource configurations for allocating cpus, memory, etc.
- Set the Custom Model runtime and native vLLM runtime as pipeline steps and deploy in Wallaroo.
- Submit inference request via:
- The Wallaroo SDK methods
completions
andchat_completion
- Wallaroo pipeline inference urls with OpenAI API endpoints extensions.
- The Wallaroo SDK methods
Tutorial Requirements
The following tutorial requires the following:
- Wallaroo version 2025.1 and above.
- Tiny Llama model and the Wallaroo RAG Custom Model. These are available from Wallaroo representatives upon request.
Tutorial Steps
Import Libraries
The following libraries are used for this tutorial, primarily the Wallaroo SDK.
import base64
import json
import os
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.framework import CustomConfig, VLLMConfig
from wallaroo.engine_config import QaicConfig
from wallaroo.object import EntityNotFoundError
from wallaroo.engine_config import Acceleration
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.openai_config import OpenaiConfig
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Upload the Wallaroo Native vLLM Runtime
The model is uploaded with the following parameters:
- The model name.
- The file path to the model.
- The framework set to Wallaroo native vLLM runtime:
wallaroo.framework.Framework.VLLM
- The input and output schemas are defined in Apache PyArrow format. For OpenAI compatibility, this is left as an empty List.
- Acceleration is set to Qualcomm QAIC for the LLM. In this example, an acceleration configuration is applied with
Acceleration.QAIC.with_config
to find tune hardware performance.
qaic_config = QaicConfig(
num_devices=4,
full_batch_size=16,
ctx_len=1024,
prefill_seq_len=128,
mxfp6_matmul=True,
mxint8_kv_cache=True
)
llama = wl.upload_model(
"llama-qaic-openai",
"llama-31-8b.zip",
framework=Framework.VLLM,
framework_config=VLLMConfig(
max_num_seqs=16,
max_model_len=1024,
max_seq_len_to_capture=128,
quantization="mxfp6",
kv_cache_dtype="mxint8",
gpu_memory_utilization=1
),
input_schema=pa.schema([]),
output_schema=pa.schema([]),
accel=Acceleration.QAIC.with_config(qaic_config)
)
Waiting for model loading - this will take up to 10min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime...................................................................................................................................................................................................................................................
Successful
Ready
llama = wl.get_model("llama-qaic-openai")
Enable OpenAI Compatibility and Continuous Batch Config
OpenAI compatibility and continuous batch config options are enabled at the model configuration after the model is uploaded.
OpenAI compatibility is enabled via the model configuration from the class wallaroo.openai_config.OpenaiConfig
includes the following main parameters. The essential one is enabled
- if OpenAI compatibility is not enabled, all other parameters are ignored.
Parameter | Type | Description |
---|---|---|
enabled | Boolean (Default: False) | If True , OpenAI compatibility is enabled. If False , OpenAI compatibility is not enabled. All other parameters are ignored if enabled=False . |
completion_config | Dict | The OpenAI API completion parameters. All completion parameters are available except stream ; the stream parameter is only set at inference requests. |
chat_completion_config | Dict | The OpenAI API chat/completion parameters. All completion parameters are available except stream ; the stream parameter is only set at inference requests. |
With the OpenaiConfig
object defined, it is when applied to the LLM configuration through the openai_config
parameter.
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)
openai_config = OpenaiConfig(
enabled=True,
completion_config={
"temperature": .3,
"max_tokens": 200
},
chat_completion_config={
"temperature": .3,
"max_tokens": 200,
"chat_template": """
{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>\n' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>\n' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>\n' + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}"""
})
llama = llama.configure(continuous_batching_config=cbc,
openai_config=openai_config)
llama
Name | llama-qaic-openai |
Version | 0c97b5ba-daac-4688-8d8e-fc1f0bcd9b9d |
File Name | llama-31-8b.zip |
SHA | 62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-qaic-vllm:v2025.1.0-6231 |
Architecture | x86 |
Acceleration | {'qaic': {'ctx_len': 1024, 'num_cores': 16, 'num_devices': 4, 'mxfp6_matmul': True, 'full_batch_size': 16, 'mxint8_kv_cache': True, 'prefill_seq_len': 128, 'aic_enable_depth_first': False}} |
Updated At | 2025-02-Jul 17:54:00 |
Workspace id | 9 |
Workspace name | younes@wallaroo.ai - Default Workspace |
Upload Embedder Model
The RAG embedder model is uploaded with the Wallaroo Custom Model framework. This allows for flexibility with Python scripts to handle requesting the context from the Mongo database through a vector query. Once uploaded, the configuration is updated to include OpenAI compatibility.
Custom Model Framework
The embedder model includes the following artifacts:
requirements.txt
: Sets what Python libraries are used.{python script}.py
: Any python script that extends the Wallaroo classes for Custom frameworks. For more details, see Wallaroo SDK Essentials Guide: Model Uploads and Registrations: Custom Model.
In this example, the requirements.txt
file is:
sentence_transformers==4.1.0
pymongo==4.7.1
The script for our openai_step.py
file is as follows:
from sentence_transformers import SentenceTransformer
import pymongo
model = SentenceTransformer("BAAI/bge-base-en") # runs on CPU by default
client = pymongo.MongoClient("mongodb+srv://wallaroo_user:random123@example.wallaroo.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0")
db = client.sample_mflix
collection = db.movies
def lookup_context(text: str):
embedding = model.encode(
text,
normalize_embeddings=True,
convert_to_numpy=True,
).tolist()
query_results = collection.aggregate(
[
{
"$vectorSearch": {
"queryVector": embedding,
"path": "plot_embedding_hf",
"numCandidates": 50,
"limit": 10,
"index": "PlotSemanticSearch",
}
}
]
)
context = " ".join([result["plot"] for result in query_results])
return context[:100]
def handle_chat_completion(request: dict) -> dict:
messages = request["messages"]
# Extract last 3 user messages
user_text = "\n".join([m["content"] for m in messages if m.get("role") == "user"][-3:])
context = lookup_context(user_text)
# Inject as system message at the top
context_msg = {"role": "system", "content": f"Context: {context}"}
request["messages"] = [context_msg] + messages
return request
def handle_completion(request: dict) -> dict:
prompt = request.get("prompt", "")
context = lookup_context(prompt)
request["prompt"] = f"Context: {context}\n\n{prompt}"
return request
# Uploading the model
rag_step = wl.upload_model(
"ragstep",
"rag_step.zip",
framework=Framework.CUSTOM,
input_schema=pa.schema([]),
output_schema=pa.schema([]),
)
Waiting for model loading - this will take up to 10min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime...........
Successful
Ready
rag_step = wl.get_model("ragstep")
openai_config_rag = OpenaiConfig(enabled=True)
rag_step = rag_step.configure(openai_config=openai_config_rag)
rag_step
Name | ragstep |
Version | 161d2b87-ffa4-4bbf-b5ce-036e5dcd1db4 |
File Name | rag_step.zip |
SHA | 5d47b8229b4410b63eb52af11f91a6c6e45eaa681e765624be2adec339f427a3 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-6231 |
Architecture | x86 |
Acceleration | none |
Updated At | 2025-02-Jul 18:01:55 |
Workspace id | 9 |
Workspace name | younes@wallaroo.ai - Default Workspace |
Set the Deployment Configuration and Deploy
The deployment configuration defines what resources are allocated to the LLM’s exclusive use. For this tutorial, the LLM is allocated:
- Llama LLM:
- 4 cpus
- 12 Gi RAM
- 4 GPU. The GPU type is inherited from the model upload step. For QAIC, each deployment configuration
gpu
values is the number of System-on-Chip (SoC) to use.
- RAG Model:
- 1 cpu
- 2 Gi RAM
Once the deployment configuration is set:
- The pipeline is created.
- The RAG model and the LLM added as a pipeline steps.
- The pipeline is deployed with the deployment configuration.
Once the deployment is complete, the LLM is ready to receive inference requests.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('1Gi') \
.sidekick_cpus(rag_step, 1) \
.sidekick_memory(rag_step, '2Gi') \
.sidekick_cpus(llama, 4) \
.sidekick_memory(llama, '12Gi') \
.sidekick_gpus(llama, 4) \
.deployment_label("kubernetes.io/os:linux") \
.build()
pipeline = wl.build_pipeline('llama-openai-ragyns')
pipeline.undeploy()
pipeline.clear()
pipeline.add_model_step(rag_step)
pipeline.add_model_step(llama)
pipeline.deploy(deployment_config = deployment_config)
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.244.69.143',
'name': 'engine-7fb7bcb47d-ssfxj',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'llama-openai-ragyns',
'status': 'Running',
'version': '1c8d179a-8dea-4e7b-8ffd-5d57b1707d6c'}]},
'model_statuses': {'models': [{'model_version_id': 113,
'name': 'llama-qaic-openai',
'sha': '62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838',
'status': 'Running',
'version': '0c97b5ba-daac-4688-8d8e-fc1f0bcd9b9d'},
{'model_version_id': 114,
'name': 'ragstep',
'sha': '5d47b8229b4410b63eb52af11f91a6c6e45eaa681e765624be2adec339f427a3',
'status': 'Running',
'version': '161d2b87-ffa4-4bbf-b5ce-036e5dcd1db4'}]}}],
'engine_lbs': [{'ip': '10.244.69.152',
'name': 'engine-lb-7765599d45-9k6f9',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.244.69.130',
'name': 'engine-sidekick-ragstep-114-6b7456f84c-wv2gq',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'},
{'ip': '10.244.69.132',
'name': 'engine-sidekick-llama-qaic-openai-113-5d9945ffd8-9jlfc',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Inference Requests on LLM with OpenAI Compatibility Enabled
Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom with OpenAI compatibility enabled in Wallaroo are performed either through the Wallaroo SDK, or via OpenAPI endpoint requests.
OpenAI API inference requests on models deployed with OpenAI compatibility enabled have the following conditions:
- Parameters for
chat/completion
andcompletion
override the existing OpenAI configuration options. - If the
stream
option is enabled:- Outputs returned as list of chunks aka as an event stream.
- The request inference call completes when all chunks are returned.
- The response metadata includes
ttft
,tps
and user-specified OpenAI request params after the last chunk is generated.
OpenAI API Inference Requests via the Wallaroo SDK and Inference Result Logs
Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:
wallaroo.pipeline.Pipeline.openai_chat_completion
: Submits an inference request using the OpenAI APIchat/completion
endpoint parameters.wallaroo.pipeline.Pipeline.openai_completion
: Submits an inference request using the OpenAI APIcompletion
endpoint parameters.
The OpenAI metrics are provided as part of the pipeline inference logs and include the following values:
ttft
tps
- The OpenAI request parameter values set during the inference request.
The method wallaroo.pipeline.Pipeline.logs
returns a pandas DataFrame by default, with the output fields labeled out.{field}
. For OpenAI inference requests, the OpenAI metrics output field is out.json
. The following demonstrates retrieving the inference results log and displaying the out.json
field, which includes the tps
and ttft
fields.
OpenAI API Inference Requests via Pipeline Deployment URLs with OpenAI Extensions
Native vLLM runtimes and Wallaroo Custom Models with OpenAI enabled perform inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions. For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:
{Deployment inference endpoint}/openai/v1/completions
: Compatible with the OpenAI API endpointcompletion
.{Deployment inference endpoint}/openai/v1/chat/completions
: Compatible with the OpenAI API endpointchat/completion
.
These requests require the following:
- A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
- Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
- Access to the deployed pipeline’s OpenAPI API endpoints.
Inference and Inference Results Logs Examples
The following demonstrates performing an inference request using openai_chat_completion
.
pipeline.openai_chat_completion(messages=[{"role": "user", "content": "closest movie title to good morning"}]).choices[0].message.content
'The closest movie title to "good morning" is likely "Good Morning, Vietnam" (1987), a comedy-drama film directed by Barry Levinson and starring Robin Williams.'
The following demonstrates performing an inference request using openai_chat_completion
with token streaming enabled.
for chunk in pipeline.openai_completion(prompt="Give me the title of a good movie from the 1990's", max_tokens=500, stream=True):
print(chunk.choices[0].text, end="", flush=True)
that fits this description.
## Step 1: Identify key elements of the movie description
The movie is told in flashbacks, it's about an older man's obsession for a woman who can belong to no-one, and it's from the 1990's.
## Step 2: Consider movies from the 1990's that fit the description
One movie that fits this description is "The English Patient" (1996), but it's not the only one. Another movie that fits is "The Piano" (1993), but it's not the one I'm thinking of.
## Step 3: Think of another movie that fits the description
A movie that fits the description is "The Piano" is not it, but "The Piano" is a good guess, another movie that fits is "The English Patient" is not it, but "The English Patient" is a good guess, but the movie I'm thinking of is "The Piano" is not it, but "The English Patient" is not it, but "The Piano" is a good guess, but the movie I'm thinking of is "The English Patient" is not it, but I think I have it.
## Step 4: Identify the movie
The movie I'm thinking of is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient" is not it, but I think I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient-shift" no, I have it, the movie is "The Piano" is not it, but I think I have it, the movie is "The English Patient
The following demonstrates using the pipeline infernence url with the OpenAI extension endpoints for completions
. First the authentication token is retrieved, then the inference request made.
token = wl.auth.auth_header()['Authorization'].split()[1]
token
'abc123'
# Streaming: Completion
!curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "", "prompt": "Give me the title of a good movie from the 1990s", "max_tokens": 100, "stream": true, "stream_options": {"include_usage": true}}' \
https://example.wallaroo.ai/v1/api/pipelines/infer/llama-openai-ragyns-63/llama-openai-ragyns/openai/v1/completions
data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":",","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":1,"total_tokens":43,"ttft":0.10214721,"tps":9.789792594433074}}
data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":" and","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":2,"total_tokens":44,"ttft":0.10214721,"tps":14.570481440179023}}
data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":" I","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":3,"total_tokens":45,"ttft":0.10214721,"tps":13.070685403999592}}
...
data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":" from","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":99,"total_tokens":141,"ttft":0.10214721,"tps":9.90478908735264}}
data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[{"text":" the","index":0,"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":42,"completion_tokens":100,"total_tokens":142,"ttft":0.10214721,"tps":9.905616236577528}}
data: {"id":"cmpl-12e8fff796b44a47a16fb74eff83e468","created":1751480108,"model":"llama-31-8b.zip","choices":[],"usage":{"prompt_tokens":42,"completion_tokens":100,"total_tokens":142,"ttft":0.10214721,"tps":9.905460621690295}}
data: [DONE]
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today