Deploy Llama with Continuous Batching Using Native vLLM Framework and QAIC AI Acceleration


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Deploy Llama with Continuous Batching Using Native vLLM Framework and QAIC AI Acceleration

The following tutorial demonstrates deploying the Llama LLM with the following enhancements:

  • The Wallaroo Native vLLM Framework: Provide performance optimizations with framework configuration options.
  • Continuous Batching: Configurable batch sizes balance latency and throughput use.
  • QAIC AI Acceleration: x86 compatible architecture at low power with AI acceleration.

For access to these sample models and for a demonstration of how to use a LLM deployment with QAIC acceleration, continuous batching, and other features:

Tutorial Goals

This tutorial demonstrates the following procedure:

  • Upload a Llama LLM with:
    • The Wallaroo Native vLLM runtime
    • QAIC AI Acceleration enabled
    • Framework configuration options to enhance performance
  • Configure continuous batching as a model configuration option.
  • Set a deployment configuration to allocate hardware resources and deploy the LLM.
  • Perform sample inferences and show both the inference results and the inference result logs.

Prerequisites

Tutorial Steps

Import libraries

The first step is to import the Python libraries required, mainly the Wallaroo SDK.

import base64

import wallaroo
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.object import EntityNotFoundError
from wallaroo.engine_config import QaicConfig
from wallaroo.framework import VLLMConfig
import pyarrow as pa
import pandas as pd

Connect to the Wallaroo Instance

Next connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

LLM Upload

Uploading the LLM takes the following steps:

  • Define Schemas: The input and output schemas are defined in Apache PyArrow format. For this tutorial, they are converted to base64 strings used for uploading through the Wallaroo MLOps API.
  • Upload the model via either the Wallaroo SDK or the Wallaroo MLOps API.

Define Schemas

The schemas are defined in Apache PyArrow format for the inputs and outputs.

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

Each is then converted to base64 strings that are later used for uploading via the Wallaroo MLops API.

base64.b64encode(
                bytes(input_schema.serialize())
            ).decode("utf8")
base64.b64encode(
                bytes(output_schema.serialize())
            ).decode("utf8")

Upload LLM

LLM uploads to Wallaroo are either via the Wallaroo SDK or the Wallaroo MLOps API.

The following demonstrates uploading the LLM via the SDK. In this example the QAIC acceleration configuration is defined. This is an optional step that fine tunes the QAIC AI Acceleration hardware performance to best fit the LLM.

qaic_config = QaicConfig(
    num_devices=4, 
    full_batch_size=16, 
    ctx_len=256, 
    prefill_seq_len=128, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True
)

LLMs are uploaded with the Wallaroo SDK method wallaroo.client.Client.upload_model. This this step, the following options are configured:

  • The model name and file path.
  • The framework, in this case the native vLLM runtime.
  • The optional framework configuration, which sets specific options for the LLM’s performance.
  • The input and output schemas.
  • The hardware acceleration set to wallaroo.engine_config.Acceleration.QAIC.with_config. The addition with_config accepts the hardware configuration options.
llm = wl.upload_model(
    "llama-31-8b-qaic", 
    "llama-31-8b.zip", 
    framework=Framework.VLLM,
    framework_config=VLLMConfig(
        max_num_seqs=16,
        max_model_len=256,
        max_seq_len_to_capture=128, 
        quantization="mxfp6",
        kv_cache_dtype="mxint8", 
        gpu_memory_utilization=1
    ),
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC.with_config(qaic_config)
)
Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime..
Model is attempting loading to a container runtime......................................................................................................................................................................................................................................
Successful
Ready

The other upload option is the Wallaroo MLOps API endpoint v1/api/models/upload_and_convert. For this option, the base64 converted input and output schemas are used, and the framework_config and accel options are specified in dict format. Otherwise, the same parameters are set:

  • The model name and file path.
  • The conversion parameter which defines:
    • The framework as native vLLM
    • The optional framework configuration, which sets specific options for the LLM’s performance.
  • The input and output schemas set as base64 strings.
  • the accel parameter which specifies the AI accelerator as qaic with the additional hardware configuration options.
curl --progress-bar -X POST \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer <your-token-here>" \
  -F 'metadata={"name": "vllm-llama-31-8b-qaic-new-v1", "visibility": "private", "workspace_id": 6, "conversion": {"framework": "vllm", "framework_config": {"framework": "vllm", "config":{"max_num_seqs": 16, "max_model_len": 256, "max_seq_len_to_capture": 128, "quantization": "mxfp6", "kv_cache_dtype": "mxint8", "gpu_memory_utilization": 1}}, "accel": {"qaic":{"num_devices":4,"full_batch_size": 16, "ctx_len": 256, "prefill_seq_len": 128, "mxfp6_matmul":true,"mxint8_kv_cache":true}}, "python_version": "3.8", "requirements": []}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
  -F "file=@llama-31-8b.zip;type=application/octet-stream" \
  https://qaic-poc.pov.wallaroo.io/v1/api/models/upload_and_convert | cat

When the llm is uploaded, we retrieve it via the wallaroo.client.Client.get_model for use in later steps.

llm = wl.get_model("llama-31-8b-qaic")
llm
Namellama-31-8b-qaic
Version0600dc44-c530-4425-a29d-9754406b0bb2
File Namellama-31-8b.zip
SHA62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-qaic-vllm:v2025.1.0-6196
Architecturex86
Acceleration{'qaic': {'ctx_len': 256, 'num_cores': 16, 'num_devices': 4, 'mxfp6_matmul': True, 'full_batch_size': 16, 'mxint8_kv_cache': True, 'prefill_seq_len': 128, 'aic_enable_depth_first': False}}
Updated At2025-12-Jun 17:46:32
Workspace id9
Workspace nameyounes@wallaroo.ai - Default Workspace

Configure Continuous Batching

Continuous batching options are applied for the model configuration with the model.Model.configure parameter. This method required both the input and output schemas, and the wallaroo.continuous_batching_config.ContinuousBatchingConfig settings.

from wallaroo.continuous_batching_config import ContinuousBatchingConfig
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)

llm = llm.configure(input_schema=input_schema,output_schema=output_schema,continuous_batching_config = cbc)

Deploy the LLM

Deploying the LLM takes the following steps:

  • Set the deployment configuration.
  • Deploy the LLM with the deployment configuration.

Set the Deployment Configuration

The deployment configuration determines what hardware resources allocated for the LLMs exclusive use. The LLM options are set via the sidekick options.

For this example, the deployment hardware includes a Qualcomm AI 100 and allocates the following resources:

  • Replicas: 1 minimum, maximum 2. This provides scalability with additional replicas scaled up or down automatically based on resource usage.
  • Cpus: 4
  • RAM: 12 Gi
  • gpus: 4
    • For Wallaroo deployment configurations for QAIC, the gpu parameter specifies the number of System-on-Chips (SoCs) allocated.
  • Deployment label: Specifies the node with the gpus.
# sidekick_gpus is the number Qualcomm AI 100 SOCs 
deployment_config = DeploymentConfigBuilder() \
    .replica_autoscale_min_max(minimum=1, maximum=2) \
    .cpus(1).memory('1Gi') \
    .sidekick_cpus(llm, 4) \
    .sidekick_memory(llm, '12Gi') \
    .sidekick_gpus(llm, 4) \
    .deployment_label("kubernetes.io/os:linux") \
    .scale_up_queue_depth(5) \
    .autoscaling_window(600) \
    .build()

The LLm is applied to a Wallaroo pipeline as a pipeline step. Once set, the pipeline is deployed with the deployment configuration. When the deployment is complete, the LLM is ready for inference requests.

pipeline = wl.build_pipeline("llama-31-qaic-yns1")
pipeline.clear()
pipeline.undeploy()
pipeline.add_model_step(llm)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.244.69.157',
   'name': 'engine-f4bf767cd-hgffn',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-31-qaic-yns1',
      'status': 'Running',
      'version': 'bf637d55-0eca-4448-8417-8cf78570dc29'}]},
   'model_statuses': {'models': [{'model_version_id': 62,
      'name': 'llama-31-8b-qaic',
      'sha': '62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838',
      'status': 'Running',
      'version': '0600dc44-c530-4425-a29d-9754406b0bb2'}]}}],
 'engine_lbs': [{'ip': '10.244.69.177',
   'name': 'engine-lb-664c6d8455-zfb4b',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.244.69.160',
   'name': 'engine-sidekick-llama-31-8b-qaic-62-5df4569fd5-nlhpm',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Inference Examples

LLMs deployed in Wallaroo accept pandas DataFrames as inference inputs. This is submitted to the pipeline with the infer method, and the results are received as a pandas DataFrame.

df = pd.DataFrame({"prompt": ["What is Wallaroo.AI?"], "max_tokens": [128]})
df.head()
promptmax_tokens
0What is Wallaroo.AI?128
pipeline.infer(df, timeout=600)
timein.max_tokensin.promptout.generated_textout.num_output_tokensanomaly.count
02025-06-12 18:33:45.902128What is Wallaroo.AI?\nWallaroo.AI is a high-performance, scalable...1280

The pipeline logs method returns a pandas DataFrame showing the inputs and outputs of the inference request.

pipeline.logs()
timein.max_tokensin.promptout.generated_textout.num_output_tokensanomaly.count
02025-06-12 18:33:45.902128What is Wallaroo.AI?\nWallaroo.AI is a high-performance, scalable...1280

For access to these sample models and for a demonstration of how to use a LLM deployment with QAIC acceleration, continuous batching, and other features: