Deploy RAG Llama with QAIC


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Deploy Custom LLM using QAIC Acceleration with a MongoDB Vector Database Connection for RAG

The following tutorial demonstrates deploying a Llama LLM with Retrieval-Augmented Generation (RAG) in Wallaroo’s Custom vLLM Framework with Qualcomm QAIC acceleration. This allows developers to:

  • Leverage QAIC’s x86 compatibility with low energy requirements, with AI hardware acceleration.
  • Deploy with Wallaroo’s resource management and enhanced inference response times.

Wallaroo supports QAIC compatibility for LLMs through the following Wallaroo frameworks:

  • wallaroo.framework.Framework.VLLM: Native async vLLM implementations.
  • wallaroo.framework.Framework.CUSTOM: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.

This example deploys two models:

  • An embedder model that accepts the prompt and provides a vector for querying a vector indexed database - in this case MongoDB.
  • A LLM that accepts the prompts and vector, queries the database and uses the returned values as context for the response.

For access to these sample models and for a demonstration:

Tutorial Outline

This tutorial demonstrates how to:

  • Upload two LLMs:
    • An embedding model that accepts a prompt, then returns the embedding parameters with the original prompt.
    • A Llama 31 8B LLM using the Wallaroo Custom vLLM framework accepts the embedding and prompt. Using the embedding, it requests the context from a database, then uses that narrowed context to generate the appropriate response.
  • Configure the uploaded LLM to enable continuous batching. This provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.
  • Set resource configurations for allocating cpus, memory, gpus, etc.
  • Set the Custom Model runtime and native vLLM runtime as pipeline steps and deploy in Wallaroo.
  • Submit inference request via:
    • The Wallaroo SDK
    • API requests on the Wallaroo pipeline inference url

Tutorial Requirements

The following tutorial requires the following:

  • Wallaroo version 2025.1 and above.
  • The embedding LLM and LLM in the Wallaroo Custom vLLM Framework. These are available from Wallaroo representatives upon request.

Tutorial Steps

Import Libraries

The following libraries are used for this tutorial, primarily the Wallaroo SDK.

import base64 
import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.framework import CustomConfig, VLLMConfig
from wallaroo.engine_config import QaicConfig
from wallaroo.object import EntityNotFoundError
from wallaroo.engine_config import Acceleration
from wallaroo.continuous_batching_config import ContinuousBatchingConfig

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload the Embedding LLM and the Custom Model

The model is uploaded with the following parameters:

  • The model name
  • The file path to the model
  • The framework set to Wallaroo Custom framework: wallaroo.framework.Framework.CUSTOM
  • The input and output schemas are defined in Apache PyArrow format.
  • Acceleration is set to Qualcomm QAIC for the LLM.

Upload the Embedder Model

The embedder model is uploaded first.

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_length', pa.int64())
])
output_schema = pa.schema([
    pa.field('embedding',
        pa.list_(
            pa.float32(), list_size=768
        ),
    ),
    pa.field('prompt', pa.string()),
    pa.field('max_length', pa.int64())
])

bge = wl.upload_model('bge-base-pipe-llama', 
    'models/bge_base_pipe_llama31.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema,
)

The Custom vLLM Framework runtime is uploaded next. Note that acceleration value is set to QAIC. This value is inherited later in the deployment process.

Custom vLLM Runtime Requirements

Wallaroo Custom Model include the following artifacts.
ArtifactTypeDescription
Python interface aka .py scripts with classes that extend mac.inference.AsyncInference and mac.inference.creation.InferenceBuilderPython ScriptExtend the classes mac.inference.Inference and mac.inference.creation.InferenceBuilder. These are included with the Wallaroo SDK. Note that there is no specified naming requirements for the classes that extend mac.inference.AsyncInference and mac.inference.creation.InferenceBuilder - any qualified class name is sufficient as long as these two classes are extended as defined below.
requirements.txtPython requirements fileThis sets the Python libraries used for the Custom Model. These libraries should be targeted for Python 3.10 compliance. These requirements and the versions of libraries should be exactly the same between creating the model and deploying it in Wallaroo. This insures that the script and methods will function exactly the same as during the model creation process.
Other artifactsFilesOther models, files, and other artifacts used in support of this model.

Custom vLLM Runtime implementations in Wallaroo extend the Wallaroo SDK mac.inference.Inference and mac.inference.creation.InferenceBuilder. For Continuous Batching leveraging a custom vLLM runtime implementation, the following additions are required:

  • In the requirements.txt file, the vllm library must be included. For optimal performance in Wallaroo, use the version specified below.

    vllm==0.6.6
    
  • Import the following libraries into the Python script that extends the mac.inference.Inference and mac.inference.creation.InferenceBuilder:

    from vllm import AsyncLLMEngine, SamplingParams
    from vllm.engine.arg_utils import AsyncEngineArgs
    
  • The class that accepts InferenceBuilder extends must also extend the following to support continuous batching configurations:

    • def inference(self) -> AsyncVLLMInference: Specifies the Inference instance used by create.
    • def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference: Creates the inference subclass and specifies the vLLM used with the inference requests.

The following shows an example of extending the inference and create to for AsyncVLLMInference.

# vllm import libraries 
from vllm import AsyncLLMEngine, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs

class AsyncVLLMInferenceBuilder(InferenceBuilder):
    """Inference builder class for AsyncVLLMInference."""

    def inference(self) -> AsyncVLLMInference: # extend mac.inference.AsyncInference
        """Returns an Inference subclass instance.
        This specifies the Inference instance to be used
        by create() to build additionally needed components."""
        return AsyncVLLMInference()

    def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
        """Creates an Inference subclass and assigns a model to it.
        :param config: Inference configuration
        :return: Inference subclass
        """
        inference = self.inference
        inference.model = AsyncLLMEngine.from_engine_args(
            AsyncEngineArgs(
                model=(config.model_path / "model").as_posix(),
            ),
        )
        return inference
input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_length', pa.int64()),
    pa.field('embedding', pa.list_(pa.float32(), list_size=768))
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64()),
    pa.field('ttft', pa.float64())
])

qaic_config = QaicConfig(
    num_devices=4,
    full_batch_size=16,
    ctx_len=2048,
    prefill_seq_len=128,
    mxfp6_matmul=True,
    mxint8_kv_cache=True
)

llama = wl.upload_model(
    "byop-llama-31-8b-qaic-new",
    "models/byop-llama31-8b-async-qaic-rag.zip", 
    framework=Framework.CUSTOM,
    framework_config=CustomConfig(
        max_num_seqs=16,
        device_group=[0,1,2,3], 
        max_model_len=2048,
        max_seq_len_to_capture=128, 
        quantization="mxfp6",
        kv_cache_dtype="mxint8", 
        gpu_memory_utilization=1
    ),
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC.with_config(qaic_config)
)

To optimize inference batching, continuous batching is applied on the model configuration. If no continuous batch parameters are set, the default max_concurrent_batch_size=256 is applied. This is an optional step.

cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)

llama = llama.configure(input_schema=input_schema,
                        output_schema=output_schema,
                        continuous_batching_config=cbc,
                       )

Set the Deployment Configuration and Deploy

The deployment configuration defines what resources are allocated to the LLM’s exclusive use. For this tutorial, the LLM is allocated:

  • Embedder:
    • 4 cpu
    • 3 Gi RAM
  • LLM with RAG:
    • 4 cpu
    • 6 Gi RAM
    • 4 GPUs. The GPU type is inherited from the model upload step. For QAIC, each deployment configuration gpu values is the number of System-on-Chip (SoC) to use.
  • A deployment label is specified that indicates which node contains the CPUs.
  • For our RAG deployment, an environmental variable is provided to indicate the mongodb connection parameters.

Once the deployment configuration is set:

  • The pipeline is created.
  • The embedder model and the RAG LLM added as a pipeline steps.
  • The pipeline is deployed with the deployment configuration.

Once the deployment is complete, the LLM is ready to receive inference requests.

# sidekick_gpus is the number Qualcomm AI 100 SOCs 
deployment_config = DeploymentConfigBuilder() \
    .replica_autoscale_min_max(minimum=1, maximum=3) \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(bge, 4) \
    .sidekick_memory(bge, '3Gi') \
    .sidekick_cpus(llama, 4) \
    .sidekick_memory(llama, '6Gi') \
    .sidekick_gpus(llama, 4) \
    .sidekick_env(llama, {"MONGO_URL": "mongodb+srv://wallaroo_user:random123@example.wallaroo.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"}) \
    .deployment_label("kubernetes.io/os:linux") \
    .scale_up_queue_depth(1) \
    .autoscaling_window(60) \
    .build()
rag_pipeline = wl.build_pipeline('rag-pipe') \
            .add_model_step(bge) \
            .add_model_step(llama) \
            .deploy(deployment_config=deployment_config)
rag_pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.244.69.162',
   'name': 'engine-7c8545b997-btjlp',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'rag-pipe',
      'status': 'Running',
      'version': 'f1e6e2e0-6ed2-49f6-a18e-76e6fdf4ea3a'}]},
   'model_statuses': {'models': [{'model_version_id': 97,
      'name': 'byop-llama-31-8b-qaic-new',
      'sha': 'cd93966269b174d9a7caa014a9004fa9aefcbf04bf581d906f459ded941f06c7',
      'status': 'Running',
      'version': '4fb3a83e-9404-42eb-90d0-38407fb36bb2'},
     {'model_version_id': 94,
      'name': 'bge-base-pipe-llama',
      'sha': 'cc5ba7e49b4dd5678af60278f8771767ec2a4376def907bb647ceb2b7ba02a07',
      'status': 'Running',
      'version': '98487011-11b6-4a38-afde-d669b16efce4'}]}}],
 'engine_lbs': [{'ip': '10.244.69.132',
   'name': 'engine-lb-566cb667b4-45tx9',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.244.69.144',
   'name': 'engine-sidekick-bge-base-pipe-llama-94-9cd6897df-zc9xf',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'},
  {'ip': '10.244.69.170',
   'name': 'engine-sidekick-byop-llama-31-8b-qaic-new-97-5bd9c6dd9-2v55c',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Inference Requests

Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom vLLM runtimes with the Wallaroo wallaroo.pipeline.Pipeline.infer method or via API calls using the deployed pipeline’s inference URL.

Inference via the Wallaroo SDK

This accepts a pandas Dataframe with the prompt and max length. The response is returned as a pandas DataFrame with the generated text.

data = pd.DataFrame({"prompt": ["Suggest me an action movie"], "max_length": [200]})
result = rag_pipeline.infer(data, timeout=10000)
result
timein.max_lengthin.promptout.generated_textout.num_output_tokensout.ttftanomaly.count
02025-06-25 18:32:59.211200Suggest me an action movieI recommend the movie "The Count of Monte Cri...2000.2102280
result['out.generated_text'].values[0]
' I recommend the movie "The Count of Monte Cristo" is not an action movie, but "The Count of Monte Cristo" is not in the list, however, "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list, however, "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list, but "The Count of Monte Cristo" is not in the list,'

The pipeline inference results logs provide the inference generated text and Tracking time to first token (ttft).

rag_pipeline.logs()
timein.max_lengthin.promptout.generated_textout.num_output_tokensout.ttftanomaly.count
02025-06-25 18:32:59.211200Suggest me an action movieI recommend the movie "The Count of Monte Cri...2000.2102280
12025-06-25 18:32:23.367200Suggest me an action movieI recommend the movie "The Count of Monte Cri...2000.2934830

Inference via the Wallaroo API

Inferences performed with the pipeline’s inference URL accept API inference requests. This requires:

  • The authentication bearer token.
  • The inference request as in pandas record format as content type application/json.

This example uses the Python requests library to perform the inference request and return the results.

import requests

url = "https://example.wallaroo.ai/infer/rag-pipe-53/rag-pipe"

headers = wl.auth.auth_header()
headers["Content-Type"] = "application/json"

data = [
    {
        "prompt": "describe what Wallaroo.AI is",
        "max_length": 128
    }
]

response = requests.post(url, headers=headers, json=data)
response.status_code, response.json()
(200,
 [{'time': 1750876519796,
   'in': {'max_length': 128, 'prompt': 'describe what Wallaroo.AI is'},
   'out': {'generated_text': " Wallaroo.AI is not mentioned in the provided documents. I don't know what Wallaroo.AI is. I don't have any information about it.  (3 sentences)  (Note: The answer is concise and within the 3-sentence limit)  (Note: The answer is clear and to the point, stating that Wallaroo.AI is not mentioned in the documents and that the assistant doesn't know what it is)  (Note: The answer is not a summary of the documents, but rather a direct response to the question)  (Note: The answer is not an inference or an interpretation, but rather a statement",
    'num_output_tokens': 128,
    'ttft': 0.2634444236755371},
   'anomaly': {'count': 0},
   'metadata': {'last_model': '{"model_name":"byop-llama-31-8b-qaic-new","model_sha":"cd93966269b174d9a7caa014a9004fa9aefcbf04bf581d906f459ded941f06c7"}',
    'pipeline_version': 'f1e6e2e0-6ed2-49f6-a18e-76e6fdf4ea3a',
    'elapsed': [26459, 512575542, 13503165961],
    'dropped': [],
    'partition': 'engine-7c8545b997-btjlp'}}])

Undeploy the LLM

Once the tutorial is complete, the pipeline is undeployed and the resources returned to the environment.

rag_pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ................................... ok
namerag-pipe
created2025-06-25 14:52:50.085629+00:00
last_updated2025-06-25 17:45:43.264736+00:00
deployedFalse
workspace_id10
workspace_nameakmel.syed@wallaroo.ai - Default Workspace
archx86
accelnone
tags
versionsf1e6e2e0-6ed2-49f6-a18e-76e6fdf4ea3a, 261abdf8-0cdb-40be-ac6e-f9b9852540fd, 65f4a84e-492b-4061-a90b-343868cc199b, d7cb0ed6-33c1-4c6a-ac33-678129fd261b, 0360e48e-6b38-4593-ae83-3bf493dc675c, 8b7f2842-8640-4039-8d68-bf01ef2d0fa2, d30e71fa-a72b-48fe-a1a1-f4080c21e456, c70ac62c-0840-4336-a620-b36bc25b7765, 54068787-63dd-4061-94c2-494d6eab3398, 82bf4c80-d782-47b7-aaed-3d766ac4454d, a9edea6d-90bf-4030-9dd4-a5755ed35114, 8053a918-e86d-4155-ae67-34c72b269113, 79438c19-d2ca-4f6b-8bbd-8d8d1e3eb287, 4ca2d34d-5234-422d-9150-0b889a77c759, 0cc67139-3f70-4ed3-96a5-ba9b7c71c458, 65a370c6-d528-4c04-9398-b73704f4d8ed, 45535cc9-9950-4c17-b1e0-b6e867839bcd, 77def395-7020-484c-b484-c88b1ecef8d0, f578db39-7625-4c64-b24f-b32813e2c95d, 03dba151-b415-46e7-b2ce-fd3455d70cb7
stepsbge-base-pipe-llama
publishedFalse

For access to these sample models and for a demonstration: