Deploy Llama with Continuous Batching Using Native vLLM Framework with QAIC using OpenAI Inference


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Deploy Llama with Continuous Batching Using Native vLLM Framework and QAIC AI Acceleration using OpenAI Compatibility

The following tutorial demonstrates deploying the Llama LLM with the following enhancements:

  • The Wallaroo Native vLLM Framework: Provide performance optimizations with framework configuration options.
  • Continuous Batching: Configurable batch sizes balance latency and throughput use.
  • QAIC AI Acceleration: x86 compatible architecture at low power with AI acceleration.
  • OpenAI API compatibility: The LLM accepts inference requests using the OpenAI completion and chat/completion endpoints, compatible with OpenAI API clients.

For access to these sample models and for a demonstration of how to use a LLM deployment with QAIC acceleration, OpenAI API compatibility, continuous batching, and other features:

Tutorial Goals

This tutorial demonstrates the following procedure:

  • Upload a Llama LLM with:
    • The Wallaroo Native vLLM runtime
    • QAIC AI Acceleration enabled
    • Framework configuration options to enhance performance
  • After upload, set the LLM configuration options:
    • Configure continuous batching and settings.
    • Enable OpenAI API compatibility and set inference options.
  • Set a deployment configuration to allocate hardware resources and deploy the LLM.
  • Publish the model and deployment configuration to an Open Container Initiative (OCI) registry for deployment in edge environments with QAIC AI accelerators installed.
  • Perform sample inferences via OpenAI API inference methods with and without token streaming.

Prerequisites

Tutorial Steps

Import libraries

The first step is to import the Python libraries required, mainly the Wallaroo SDK.

Imports

import base64

import wallaroo
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.object import EntityNotFoundError
from wallaroo.engine_config import QaicConfig
from wallaroo.framework import VLLMConfig
import pyarrow as pa
import pandas as pd
from wallaroo.openai_config import OpenaiConfig
from wallaroo.continuous_batching_config import ContinuousBatchingConfig

Connect to the Wallaroo Instance

Next connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

LLM Upload

Uploading the LLM takes the following steps:

  • Define Schemas: The input and output schemas are defined in Apache PyArrow format. For this tutorial, they are converted to base64 strings used for uploading through the Wallaroo MLOps API.
  • Upload the model via either the Wallaroo SDK or the Wallaroo MLOps API.

Upload LLM

LLM uploads to Wallaroo are either via the Wallaroo SDK or the Wallaroo MLOps API.

The following demonstrates uploading the LLM via the SDK. In this example the QAIC acceleration configuration is defined. This is an optional step that fine tunes the QAIC AI Acceleration hardware performance to best fit the LLM.

qaic_config = QaicConfig(
    num_devices=4, 
    full_batch_size=16, 
    ctx_len=256, 
    prefill_seq_len=128, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True
)

LLMs are uploaded with the Wallaroo SDK method wallaroo.client.Client.upload_model. This this step, the following options are configured:

  • The model name and file path.
  • The framework, in this case the native vLLM runtime.
  • The optional framework configuration, which sets specific options for the LLM’s performance.
  • The input and output schemas. For OpenAI compatibility, these are ignored so are set as empty sets.
  • The hardware acceleration set to wallaroo.engine_config.Acceleration.QAIC.with_config. The addition with_config accepts the hardware configuration options.
model = wl.upload_model(
    "llama-qaic-openai", 
    "llama-31-8b.zip", 
    framework=Framework.VLLM,
    framework_config=VLLMConfig(
        max_num_seqs=16,
        max_model_len=256,
        max_seq_len_to_capture=128, 
        quantization="mxfp6",
        kv_cache_dtype="mxint8", 
        gpu_memory_utilization=1
    ),
    input_schema=pa.schema([]),
    output_schema=pa.schema([]), 
    accel=Acceleration.QAIC.with_config(qaic_config)
)
Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime..
Model is attempting loading to a container runtime..................................................................................................................................................................................................................................
Successful
Ready

The other upload option is the Wallaroo MLOps API endpoint v1/api/models/upload_and_convert. For this option, the base64 converted input and output schemas are used, and the framework_config and accel options are specified in dict format. Otherwise, the same parameters are set:

  • The model name and file path.
  • The conversion parameter which defines:
    • The framework as native vLLM
    • The optional framework configuration, which sets specific options for the LLM’s performance.
  • The input and output schemas set as base64 strings.
  • The accel parameter which specifies the AI accelerator as qaic with the additional hardware configuration options.

The other upload option is the Wallaroo MLOps API endpoint v1/api/models/upload_and_convert. For this option, the base64 converted input and output schemas are used, and the framework_config and accel options are specified in dict format. Otherwise, the same parameters are set:

  • The model name and file path.
  • The conversion parameter which defines:
    • The framework as native vLLM
    • The optional framework configuration, which sets specific options for the LLM’s performance.
  • The input and output schemas set as base64 strings.
  • The accel parameter which specifies the AI accelerator as qaic with the additional hardware configuration options.
curl --progress-bar -X POST \
    -H "Content-Type: multipart/form-data" \
    -H "Authorization: Bearer "abc123" \
    -F \'metadata={"name": "llama-qaic-openai", "visibility": "private", "workspace_id": 6, "conversion": {"arch": "x86", "accel": {"qaic": {"aic_enable_depth_first": false, "ctx_len": 256, "full_batch_size": 16, "mxfp6_matmul": true, "mxint8_kv_cache": true, "num_cores": 16, "num_devices": 4, "prefill_seq_len": 128}}, "framework": "vllm", "framework_config": {"config": {"gpu_memory_utilization": 0.9, "kv_cache_dtype": "auto", "max_num_seqs": 256, "max_seq_len_to_capture": 8192, "quantization": "none"}, "framework": "vllm"}, "python_version": "3.8", "requirements": []}, "input_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA=", "output_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA="};type=application/json\' \
    -F "file=@llama-31-8b.zip;type=application/octet-stream" \
    https://doc-test.wallarooexample.ai/v1/api/models/upload_and_convert | cat

When the llm is uploaded, we retrieve it via the wallaroo.client.Client.get_model for use in later steps.

model = wl.get_model("llama-qaic-openai")
model
Namellama-qaic-openai
Version0c97b5ba-daac-4688-8d8e-fc1f0bcd9b9d
File Namellama-31-8b.zip
SHA62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-qaic-vllm:v2025.1.0-6231
Architecturex86
Acceleration{'qaic': {'ctx_len': 1024, 'num_cores': 16, 'num_devices': 4, 'mxfp6_matmul': True, 'full_batch_size': 16, 'mxint8_kv_cache': True, 'prefill_seq_len': 128, 'aic_enable_depth_first': False}}
Updated At2025-02-Jul 17:54:00
Workspace id9
Workspace nameyounes@wallaroo.ai - Default Workspace

Configure Continuous Batching

Continuous batching options are applied for the model configuration with the model.Model.configure parameter. This method required both the input and output schemas, and the wallaroo.continuous_batching_config.ContinuousBatchingConfig settings.

from wallaroo.continuous_batching_config import ContinuousBatchingConfig
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)

Configure OpenAI Compatibility

OpenAI compatibility options are set through the wallaroo.openai_config.OpenaiConfig object, with the most important being:

  • enabled: Enables OpenAI compatibility
  • completion_config: Sets the OpenAI completion endpoint options except stream; the stream option is only provided at inference.
  • chat_completion_config: Sets the OpenAI chat/completion endpoint options except stream; the stream option is only provided at inference.
openai_config = OpenaiConfig(
    enabled=True,
    completion_config={
        "temperature": .3,
        "max_tokens": 200
    },
    chat_completion_config={
        "temperature": .3,
        "max_tokens": 200,
        "chat_template": """
        {% for message in messages %}
            {% if message['role'] == 'user' %}
                {{ '<|user|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'system' %}
                {{ '<|system|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'assistant' %}
                {{ '<|assistant|>\n'  + message['content'] + eos_token }}
            {% endif %}
            
            {% if loop.last and add_generation_prompt %}
                {{ '<|assistant|>' }}
            {% endif %}
        {% endfor %}"""
    })

Set LLM Configuration

Both the continuous deployment and OpenAI API compatibility are set through the LLM’s configure method.

model = model.configure(openai_config=openai_config, continuous_batching_config = cbc)
model
Namellama-qaic-openai
Version0c97b5ba-daac-4688-8d8e-fc1f0bcd9b9d
File Namellama-31-8b.zip
SHA62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-qaic-vllm:v2025.1.0-6231
Architecturex86
Acceleration{'qaic': {'ctx_len': 1024, 'num_cores': 16, 'num_devices': 4, 'mxfp6_matmul': True, 'full_batch_size': 16, 'mxint8_kv_cache': True, 'prefill_seq_len': 128, 'aic_enable_depth_first': False}}
Updated At2025-02-Jul 17:54:00
Workspace id9
Workspace nameyounes@wallaroo.ai - Default Workspace

Deploy the LLM

Deploying the LLM takes the following steps:

  • Set the deployment configuration.
  • Deploy the LLM with the deployment configuration.

Set the Deployment Configuration

The deployment configuration determines what hardware resources allocated for the LLMs exclusive use. The LLM options are set via the sidekick options.

For this example, the deployment hardware includes a Qualcomm AI 100 and allocates the following resources:

  • Cpus: 4
  • RAM: 12 Gi
  • gpus: 4
    • For Wallaroo deployment configurations for QAIC, the gpu parameter specifies the number of System-on-Chips (SoCs) allocated.
  • Deployment label: Specifies the node with the gpus.
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('1Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '12Gi') \
    .sidekick_gpus(model, 4) \
    .deployment_label("kubernetes.io/os:linux") \
    .build()

The LLm is applied to a Wallaroo pipeline as a pipeline step. Once set, the pipeline is deployed with the deployment configuration. When the deployment is complete, the LLM is ready for inference requests.

pipeline = wl.build_pipeline("llamaqaicopenaiedge")
pipeline.clear()
pipeline.undeploy()
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)

Inference Examples

LLMs deployed in Wallaroo accept pandas DataFrames as inference inputs. These examples use the OpenAI API inference via the Wallaroo SDK and API clients.

Inference requests via OpenAI compatible methods override the OpenAI configurations applied at the LLM level. This provides additional optimizations and flexibility as needed.

OpenAI Inference via the Wallaroo SDK

Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:

  • wallaroo.pipeline.Pipeline.openai_chat_completion: Submits an inference request using the OpenAI API chat/completion endpoint parameters.
  • wallaroo.pipeline.Pipeline.openai_completion: Submits an inference request using the OpenAI API completion endpoint parameters.

Each example demonstrates using these methods with and without token streaming.

The following demonstrates performing an inference with openai_chat_completion. Note that the same parameters passed match the ones used for the OpenAI chat/completion endpoint.

# Performing completions

pipeline.openai_chat_completion(messages=[{"role": "user", "content": "good morning"}]).choices[0].message.content
'Hello! How can I assist you today?'

This example uses the openai_completion method.

pipeline.openai_completion(prompt="tell me about wallaroo.AI", max_tokens=200).choices[0].text
'\nWallaroo.ai is a cloud-based platform that provides a suite of artificial intelligence (AI) and machine learning (ML) tools for data scientists, developers, and business users. The platform is designed to simplify the process of building, deploying, and managing AI and ML models, making it easier for organizations to leverage the power of AI to drive business outcomes.\nKey Features of Wallaroo.ai:\n1. Model Development: Wallaroo.ai provides a range of tools and libraries for building, training, and deploying AI and ML models, including support for popular frameworks like TensorFlow, PyTorch, and scikit-learn.\n2. Model Serving: The platform offers a scalable and secure model serving layer that allows users to deploy and manage AI and ML models in production environments.\n3. Data Integration: Wallaroo.ai provides seamless integration with various data sources, including relational databases, NoSQL databases, cloud storage, and streaming data sources.\n4. Model Monitoring: The platform offers real-time monitoring and analytics capabilities'

The following examples show the same methods, with token streaming enabled.

# Now with streaming

for chunk in pipeline.openai_chat_completion(messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=100, stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)
I'm happy to help you with your short story about love. What kind of love are you writing about? Is it romantic love, familial love, or something else?
# Now with streaming

for chunk in pipeline.openai_completion(prompt="tell me about wallaroo.AI", max_tokens=200, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

Wallaroo.ai is an AI platform that enables developers to build, deploy, and manage AI-powered applications with ease. Here's a brief overview of what it offers:
Key Features of Wallaroo.ai:
1. **Model Serving**: Wallaroo.ai provides a scalable and secure model serving platform that allows developers to deploy and manage AI models in production environments.
2. **Model Management**: The platform offers a centralized model management system that enables developers to manage multiple models, track performance, and monitor metrics.
3. **Auto-Scaling**: Wallaroo.ai's auto-scaling feature ensures that AI models can handle 맞 variable workloads, ensuring high performance and availability.
4. **Security**: The platform provides robust security features, including encryption, access control, and auditing, to protect sensitive data and models.
5. **Integration**: Wallaroo.ai supports integration with popular AI frameworks, such as TensorFlow, PyTorch, and scikit-learn, making it easy to deploy and manage AI models.
6

OpenAI Inference via the OpenAI API Requests

Inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions. For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:

  • {Deployment inference endpoint}/openai/v1/completions: Compatible with the OpenAI API endpoint completion.
  • {Deployment inference endpoint}/openai/v1/chat/completions: Compatible with the OpenAI API endpoint chat/completion.

These requests require the following:

  • A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
  • Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
  • Access to the deployed pipeline’s OpenAPI API endpoints.

The first example shows retrieving the authentication token to the Wallaroo instance.

token = wl.auth.auth_header()['Authorization'].split()[1]
token
'abc123'

This example performs an inference request with token streaming enabled on the completions endpoint.

# Streaming: Completion
!curl -X POST \
  -H "Authorization: Bearer abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100, "stream": true, "stream_options": {"include_usage": true}}' \
  https://qaic-poc.pov.wallaroo.io/v1/api/pipelines/infer/llamaqaic-openai-32/llamaqaic-openai/openai/v1/completions
data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" about","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":1,"total_tokens":7,"ttft":0.091563446,"tps":10.921388869527693}}

data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" a","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":2,"total_tokens":8,"ttft":0.091563446,"tps":16.38253480899562}}

data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" character","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":3,"total_tokens":9,"ttft":0.091563446,"tps":16.485729285058817}}

...

data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" mere","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":99,"total_tokens":105,"ttft":0.091563446,"tps":10.019649983937533}}

data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" touch","index":0,"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":100,"total_tokens":106,"ttft":0.091563446,"tps":10.020860629999026}}

data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[],"usage":{"prompt_tokens":6,"completion_tokens":100,"total_tokens":106,"ttft":0.091563446,"tps":10.020269063389119}}

data: [DONE]

This example performs an inference request with token streaming enabled on the chat/completions endpoint.

# Streaming: Chat completion
!curl -X POST \
  -H "Authorization: Bearer abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "tell me a story"}], "max_tokens": 100, "stream": true, "stream_options": {"include_usage": true}}' \
  https://qaic-poc.pov.wallaroo.io/v1/api/pipelines/infer/llamaqaic-openai-32/llamaqaic-openai/openai/v1/chat/completions
data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":39,"completion_tokens":0,"total_tokens":39,"ttft":0.093523807,"tps":0.0}}

data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"Once"}}],"usage":{"prompt_tokens":39,"completion_tokens":1,"total_tokens":40,"ttft":0.093523807,"tps":10.679028938385025}}

data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" upon"}}],"usage":{"prompt_tokens":39,"completion_tokens":2,"total_tokens":41,"ttft":0.093523807,"tps":15.893348258670727}}

...

data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":"length","message":null,"delta":{"role":null,"content":" pointing"}}],"usage":{"prompt_tokens":39,"completion_tokens":100,"total_tokens":139,"ttft":0.093523807,"tps":10.511513285045494}}

data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[],"usage":{"prompt_tokens":39,"completion_tokens":100,"total_tokens":139,"ttft":0.093523807,"tps":10.511361686804662}}

data: [DONE]

OpenAI Inference via the OpenAI Python Library

The following uses the OpenAI Python library to perform the inferences, using the same OpenAI endpoints.

from openai import OpenAI
client = OpenAI(
    base_url='https://qaic-poc.pov.wallaroo.io/v1/api/pipelines/infer/llamaqaic-openai-32/llamaqaic-openai/openai/v1',
    api_key=token
)
for chunk in client.chat.completions.create(model="dummy", messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=100, stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)
I'd love to hear it. Please go ahead and share the short story about love. I'll be happy to listen and respond.
for chunk in client.completions.create(model="dummy", prompt="tell me about wallaroo.AI", max_tokens=100, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

Introducing wallaroo.AI
Wallايي буду Towards a Optimization Approach
For AI-Driven Sports Optimization and Trading
Background: One-click strategy models, platform agnostic, similarity testing
Theory: Extreme Value Theory ( EVT ) , GARCH , Kalman Filter algorithm
Impact: speeding through complex sett FL startup focusing implementation wallaroo.Readingmy Business model
wallaroo.ai is an AI-driven platform that aims to revolutionize the way we approach sports optimization and trading. The platform leverages

Pipeline Publish

Pipelines are saved to an OCI registry via the wallaroo.pipeline.Pipeline.publish_pipeline(deployment_config) command. During this step, the following is uploaded to an OCI registry configured to work with the Wallaroo Ops Center:

  • The LLM and pipeline steps.
  • The included deployment configuration.
  • The Wallaroo Inference Engine compatible with the architecture and acceleration inherited from the model settings, in this case the QAIC AI accelerator.

Once published, the LLM is deployed with Docker, Podman, or Helm - each command provided as part of the output. For more details, see Edge and Multi-cloud Deployment and Inference.

pipeline.publish(deployment_config=deployment_config)
Waiting for pipeline publish... It may take up to 600 sec.
....................................................................................... Published.
ID20
Pipeline Namellamaqaicopenaiedge
Pipeline Versiona0db44db-cc58-437e-9e73-2da3e3ae45e9
StatusPublished
Workspace Id9
Workspace Nameyounes@wallaroo.ai - Default Workspace
Edges
Engine URLus-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261
Pipeline URLus-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9
Helm Chart URLoci://us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/charts/llamaqaicopenaiedge
Helm Chart Referenceus-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/charts@sha256:bf9847efbca0c798d823afb11820b7e74f802233d0762d60b409b2023ab04d2e
Helm Chart Version0.0.1-a0db44db-cc58-437e-9e73-2da3e3ae45e9
Engine Config{'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '1Gi'}, 'requests': {'cpu': 1.0, 'memory': '1Gi'}, 'accel': {'qaic': {'aic_enable_depth_first': False, 'ctx_len': 1024, 'full_batch_size': 16, 'mxfp6_matmul': True, 'mxint8_kv_cache': True, 'num_cores': 16, 'num_devices': 4, 'prefill_seq_len': 128}}, 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none', 'cpu_utilization': 50.0}, 'images': {'llama-qaic-openai-113': {'resources': {'limits': {'cpu': 4.0, 'memory': '12Gi'}, 'requests': {'cpu': 4.0, 'memory': '12Gi'}, 'accel': {'qaic': {'aic_enable_depth_first': False, 'ctx_len': 1024, 'full_batch_size': 16, 'mxfp6_matmul': True, 'mxint8_kv_cache': True, 'num_cores': 16, 'num_devices': 4, 'prefill_seq_len': 128}}, 'arch': 'x86', 'gpu': False}}}}}
User Images[]
Created Bysample.user@wallaroo.ai
Created At2025-07-16 18:51:02.396949+00:00
Updated At2025-07-16 18:51:02.396949+00:00
Replaces
Docker Run Command
docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
    -e CONFIG_CPUS=1.0 --cpus=5.0 --memory=13g \
    us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Podman Run Command
podman run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
    -e CONFIG_CPUS=1.0 --cpus=5.0 --memory=13g \
    us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Helm Install Command
helm install --atomic $HELM_INSTALL_NAME \
    oci://us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/charts/llamaqaicopenaiedge \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

Deploy on Edge Devices

Deploying ML Models with Qualcomm QAIC hardware with Intel GPUs in edge and multi-cloud environments via docker run require additional parameters.

For QAIC deployments via docker, additional parameters are required depending on the devices used.

  • For All Devices: For all devices on the edge deployment, the parameter --privileged is required. The following example is a sample command deploying a Wallaroo pipeline published in an OCI registry on an edge device with QAIC AI accelerators. Based on the previous example:

    docker run \
      -p $EDGE_PORT:8080 \
      -e OCI_USERNAME=$OCI_USERNAME \
      -e OCI_PASSWORD=$OCI_PASSWORD \
      -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
      -e CONFIG_CPUS=1.0 --cpus=5.0 --memory=13g \
      --privileged \
      us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261
    
  • For specific devices, each device is specified via the --device parameter. The following example specifies devices accel4 through accel7:

    docker run \
      -p $EDGE_PORT:8080 \
      -e OCI_USERNAME=$OCI_USERNAME \
      -e OCI_PASSWORD=$OCI_PASSWORD \
      -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
      -e CONFIG_CPUS=1.0 --cpus=5.0 --memory=13g \
    --device=/dev/accel/accel4 \
    --device=/dev/accel/accel5 \
    --device=/dev/accel/accel6 \
    --device=/dev/accel/accel7 \
    us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261