Inference on Qualcomm QAIC AI Acceleration

How to deploy AI/ML models with Qualcomm QAIC Processors

Table of Contents

AI/ML models can be deployed in centralized Wallaroo OPs instances and Edge devices on a variety of infrastructures and processors. The CPU infrastructure and AI acceleration type is set during the model upload and packaging stage.

Wallaroo supports Qualcomm QAIC, providing high performance x86 compatible processing with AI acceleration at low power costs. This increases the performance of LLM models with lower energy needs.

For details on using QAIC with Wallaroo and setting up a demonstration:

QAIC AI Acceleration Features

QAIC AI Acceleration delivers a x86 compatible architecture with AI acceleration with low power cost. The following Wallaroo features are supported for LLMs with QAIC AI acceleration deployed in Wallaroo:

  • OpenAI API Compatibility: Provides OpenAI API client compatible inference requests with optional token streaming.
  • Replica autoscaling: Spin up or down replicas based on utilization criteria to optimize resource allocation and minimize costs.
  • Continuous Batching: Improves throughput by dynamically grouping incoming requests in real time to optimize inference processing.

Model Packaging and Deployments Prerequisites for QAIC

To upload and package a model for Wallaroo Ops or multicloud edge deployments, the following prerequisites must be met.

  • Wallaroo Ops
    • At least one QAIC node deployed in the cluster.

AI Workloads for QAIC via the Wallaroo SDK

The Wallaroo SDK provides QAIC support for models uploaded for Wallaroo Ops.

Upload Models for QAIC via the Wallaroo SDK

Models are uploaded to Wallaroo via the wallaroo.client.upload_model method. For QAIC support, the following architecture and acceleration settings are used:

  • The AI acceleration is set with the accel parameter. For QAIC, this accepts the wallaroo.engine_config.Acceleration.QAIC.
    • (Optional) Set the acceleration configuration options to fine tune hardware performance.

Note that QAIC processors are x86 compatible, so no changes are needed to the model upload default architecture of X86.

The method wallaroo.client.Client.upload_model takes the following parameters:

ParameterTypeDescription
namestring (Required)The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
pathstring (Required)The path to the model file being uploaded.
frameworkstring (Required)The framework of the model from wallaroo.framework.Framework. For native vLLM, this framework is wallaroo.framework.Framework.VLLM.
input_schemapyarrow.lib.Schema (Required)The input schema in Apache Arrow schema format.
output_schemapyarrow.lib.Schema (Required)The output schema in Apache Arrow schema format.
framework_configwallaroo.framework.VLLMConfig (Optional)Sets the vLLM framework configuration options.
accelwallaroo.engine_config.Acceleration.QAIC (Required) OR wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig) (Optional)The AI hardware accelerator used. Submitting with the with_config(QaicConfig) parameters overrides the hardware performance defaults.
convert_waitbool (Optional)
  • True: Waits in the script for the model conversion completion.
  • False: Proceeds with the script without waiting for the model conversion process to display complete.

QAIC hardware performance is configurable at model upload with wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig). This provides additional hardware fine tuning. If no acceleration parameters are defined, the default values are applied.

wallaroo.engine_config.QaicConfig takes the following parameters.

ParametersTypeDescription
num_coresInteger (Default: 16)Number of cores used to compile the model.
num_devicesInteger (Default: 1)Number of System-on-Chip (SoC) in a given card to compile the model for.
ctx_lenInteger (*Default: 128)Maximum context that the compiled model remembers.
prefill_seq_lenIntegerThe length of the Prefill prompt.
full_batch_sizeInteger (Default: None)Maximum number of sequences per iteration. Set to enable continuous batching mode.
mxfp6_matmulBoolean (Default: False)Enable compilation for MXFP6 precision.
mxint8_kv_cacheBoolean (Default: False)Compress Present/Past KV to MXINT8.
aic_enable_depth_firstBoolean (Default: False)Enables DFS with default memory size.

Upload Model for QAIC AI Acceleration via the Wallaroo SDK Example

The following demonstrates uploading a model for deployment on with QAIC AI acceleration. The input and output schemas are optional depending on the model runtime. For more details, see Model Upload

The following shows uploading the LLM with QAIC AI acceleration enabled without the acceleration configuration options.

import wallaroo

# set the Wallaroo client
wl = wallaroo.Client()

model = wl.upload_model(
    model_name, 
    model_file_name, 
    framework=framework,
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC
)

The following demonstrates uploading the LLM with QAIC AI acceleration enabled with the acceleration configuration options.

import wallaroo

# set the Wallaroo client
wl = wallaroo.Client()

# Set the QAIC acceleration parameters.  This is an **optional** step
qaic_config = wallaroo.engine_config.QaicConfig(
    num_devices=4, 
    full_batch_size=16, 
    ctx_len=256, 
    prefill_seq_len=128, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True
)

model = wl.upload_model(
    "sample-model-name", 
    "sample-model-file.zip", 
    framework=framework
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC.with_config(qaic_config)
)

Deploy Models for QAIC AI Acceleration via the Wallaroo SDK

Models are added to pipeline as pipeline steps. Models are then deployed through the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig] = None) method.

When deploying a model in a Wallaroo Ops instance, the deployment configurations inherits the model acceleration setting.Other settings, such as the number of CPUs, etc can be changed without modifying the acceleration setting.

The deployment configuration sets what resources are allocated for the model. For this example, the model is allocated the following:

  • cpus: 4
  • RAM: 12 Gi
  • gpus: 4
    • For Wallaroo deployment configurations for QAIC, the gpu parameter specifies the number of SoCs allocated.
  • Deployment label: Specifies the node with the QAIC SoCs.
from wallaroo.deployment_config import DeploymentConfigBuilder

deployment_config = DeploymentConfigBuilder() \
    .replica_autoscale_min_max(minimum=1, maximum=2) \
    .cpus(1).memory('1Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '12Gi') \
    .sidekick_gpus(model, 4) \
    .deployment_label("kubernetes.io/os:linux") \
    .build()

To change the acceleration settings for model deployment, models should be re-uploaded as either a new model or a new model version for maximum compatibility with the hardware infrastructure.

The following demonstrates deploying a generic AI/ML model with the architecture set to QAIC. For this example, the model is deployed with a pre-determined deployment configuration saved to deployment_config.

# create the pipeline
pipeline = wl.build_pipeline("sample_pipeline")

# set the pipeline model step as the model set to the Power10 architecture

pipeline.add_model_step(model)

# deploy the pipeline with the deployment configuration

pipeline.deploy(deployment_configuration)

Publish and Deploy for Edge and Multi-Cloud Environments

Wallaroo supports deploying models on edge and multi-cloud environments through publishing the Wallaroo pipeline with the model and deployment configuration to an Open Container Initiative (OCI) registry. These pipeline publishes are deployed on devices with QAIC hardware.

Publish to OCI Registry via the Wallaroo SDK

Wallaroo pipelines are published to an OCI registry with the following elements:

  • The ML models added as pipeline steps.
  • The Wallaroo inference engine inherited from the model’s acceleration settings - in this case, QAIC.
  • The deployment configuration specified in the publish command.

Pipelines are published as images to the edge registry set in the Enable Wallaroo Edge Registry with the wallaroo.pipeline.Pipeline.publish method.

When a pipeline is published, the containerized pipeline with its models, and the inference engine for the architecture and acceleration uploaded to the OCI registry. Once published, the publish is deployed on edge locations with Docker, Podman, or helm based deployments. For more details, see Edge and Multi-cloud Pipeline Publish.

Publish a Pipeline Parameters

The wallaroo.pipeline.Pipeline.publish method takes the following parameters. The containerized pipeline will be pushed to the Edge registry service with the model, pipeline configurations, and other artifacts needed to deploy the pipeline.

ParameterTypeDescription
deployment_configwallaroo.deployment_config.DeploymentConfig (Optional)Sets the pipeline deployment configuration. For more information on pipeline deployment configuration, see the Wallaroo SDK Essentials Guide: Pipeline Deployment Configuration.
replaces**List[wallaroo.pipeline_publish] (Optional)The pipeline publish(es) to replace.

Publish a Pipeline Returns

The following publish fields are displayed with the method IPython.display.

FieldTypeDescription
IDIntegerThe numerical ID of the publish.
Pipeline NameStringThe pipeline the publish was generated from.
Pipeline VersionStringThe pipeline version the publish was generated from, in UUID format.
StatusStringThe status of the publish. Values include:
  • PendingPublish: The pipeline publication is about to be uploaded or is in the process of being uploaded.
  • Published: The pipeline is published and ready for use.
Workspace IdIntegerThe numerical id of the workspace the publish is associated with.
Workspace NameStringThe name of the workspace the publish is associated with.
EdgesList[String]A list of edges associated with this publish. If no edges exist, this field will be empty.
Engine URLStringThe OCI Registry URL for the inference engine.
Pipeline URLStringThe OCI Registry URL of the containerized pipeline.
Helm Chart URLStringThe OCI Registry URL of the Helm chart.
Helm Chart ReferenceStringThe OCI Registry URL of the Helm Chart reference.
Helm Chart VersionStringThe Helm Chart Version.
Engine ConfigDictThe details of the wallaroo.engine_config used for the publish. Unless specified, it will use the same engine config for the pipeline, which inherits its arch and accel settings from the model upon upload. See Wallaroo SDK Essentials Guide: Model Uploads and Registrations for more details.
User ImagesListAny user images used with the deployment.
Created ByStringThe user name, typically the email address, of the user that created the publish.
Created AtDateTimeThe DateTime of the publish was created.
Updated AtDateTimeThe DateTime of the publish was updated.
ReplacesListA list of the publishes that were replaced by this one with the following attributes. Note that each variable represents the value displayed:
  • Publish id: The replaced publish id.
  • Pipeline pipeline_name: The name of the replaced pipeline.
  • Version pipeline_version: The pipeline version replace. Pipeline versions are displayed in UUID format.
Docker Run CommandStringThe Docker Run commands for the publish. The following variables must be set before executing the command.
  • EDGE_PORT: The external port used to connect to the edge endpoints.
  • OCI_USERNAME: The username for the OCI registry containing the publish.
  • OCI_PASSWORD: The password for the user used to authenticate to the OCI registry containing the publish.

Additional options are detailed in the DevOps - Pipeline Edge Deployment
Podman Run CommandStringThe Podman run commands for each edge location for the publish. The following variables must be set before executing the command.
  • EDGE_PORT: The external port used to connect to the edge endpoints.
  • OCI_USERNAME: The username for the OCI registry containing the publish.
  • OCI_PASSWORD: The password for the user used to authenticate to the OCI registry containing the publish.

Additional options are detailed in the DevOps - Pipeline Edge Deployment
Helm Install CommandStringThe Helm Install or Upgrade commands for each location or replaced locations for the pipeline. For replaced publishes, the helm upgrade command is shown for performing in-line model updates. The following variables must be set before executing the command.
  • OCI_USERNAME: The username for the OCI registry containing the publish.
  • OCI_PASSWORD: The password for the user used to authenticate to the OCI registry containing the publish.
  • HELM_INSTALL_NAME: The name of the Helm deployment installation.
  • HELM_INSTALL_NAMESPACE: The Kubernetes namespace the Helm based edge deployment is installed in.
Individual Publish Fields

The following fields are available from the PipelinePublish object.

FieldTypeDescription
idIntegerNumerical Wallaroo id of the published pipeline.
pipeline_nameStringThe name of the pipeline the publish is generated from.
pipeline_version_idIntegerNumerical Wallaroo id of the pipeline version published.
statusStringThe status of the pipeline publication. Values include:
  • PendingPublish: The pipeline publication is about to be uploaded or is in the process of being uploaded.
  • Published: The pipeline is published and ready for use.
engine_urlStringThe URL of the published pipeline engine in the edge registry.
pipeline_urlStringThe URL of the published pipeline in the edge registry.
pipeline_version_nameStringThe pipeline version in UUID format.
helmDictThe details used for a helm based deployment of the with the following attributes:
  • reference: The OCI URL of the Helm reference.
  • values: Any additional values.
  • chart: The Helm chart for the edge deployment.
  • version: The Helm version to specify which published pipeline Helm version to use.
additional_propertiesDictAny additional properties for the publish.
docker_run_variablesThe Docker Run variables used for Docker based deployments. This includes:
  • PIPELINE_URL: The OCI registry URL of the containerized pipeline.
  • EDGE_BUNDLE: The Edge Bundle for edge locations.
list_edges()wallaroo.edge.EdgeListA List of wallaroo.edge.Edge associated with the publish.
engine_urlStringThe URL for the inference engine used for the edge deployment.
user_imagesListA List of custom images used for the edge deployment.
created_byStringThe unique identifier of the user ID that created the publish in UUID format.
errorStringAny errors associated with the publish.
engine_configwallaroo.deployment_config.DeploymentConfigThe pipeline configuration included with the published pipeline.
created_atDateTimeWhen the published pipeline was created.
updated_atDateTimeWhen the published pipeline was updated.
created_on_versionStringThe version of Wallaroo the publish was generated from.
replacesList(Integer)List of other publishes that were replaced by this one.

Publish a Pipeline Example

The following example shows how to publish a pipeline to the edge registry service associated with the Wallaroo instance for an Edge device with QAIC AI accelerator chips. In this example, the deployment configuration is set with:

  • Cpus: 4
  • RAM: 12 Gi
  • gpus: 4
    • For Wallaroo deployment configurations for QAIC, the gpu parameter specifies the number of System-on-Chips (SoCs) allocated.
  • Deployment label: Specifies the node with the gpus.
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('1Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '12Gi') \
    .sidekick_gpus(model, 4) \
    .deployment_label("kubernetes.io/os:linux") \
    .build()

The pipeline is then published with the `wallaroo.pipeline.Pipeline.publish(deployment_config) command.

pipeline.publish(deployment_config=deployment_config)
Waiting for pipeline publish... It may take up to 600 sec.
....................................................................................... Published.
ID20
Pipeline Namellamaqaicopenaiedge
Pipeline Versiona0db44db-cc58-437e-9e73-2da3e3ae45e9
StatusPublished
Workspace Id9
Workspace Nameyounes@wallaroo.ai - Default Workspace
Edges
Engine URLus-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261
Pipeline URLus-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9
Helm Chart URLoci://us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/charts/llamaqaicopenaiedge
Helm Chart Referenceus-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/charts@sha256:bf9847efbca0c798d823afb11820b7e74f802233d0762d60b409b2023ab04d2e
Helm Chart Version0.0.1-a0db44db-cc58-437e-9e73-2da3e3ae45e9
Engine Config{'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '1Gi'}, 'requests': {'cpu': 1.0, 'memory': '1Gi'}, 'accel': {'qaic': {'aic_enable_depth_first': False, 'ctx_len': 1024, 'full_batch_size': 16, 'mxfp6_matmul': True, 'mxint8_kv_cache': True, 'num_cores': 16, 'num_devices': 4, 'prefill_seq_len': 128}}, 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none', 'cpu_utilization': 50.0}, 'images': {'llama-qaic-openai-113': {'resources': {'limits': {'cpu': 4.0, 'memory': '12Gi'}, 'requests': {'cpu': 4.0, 'memory': '12Gi'}, 'accel': {'qaic': {'aic_enable_depth_first': False, 'ctx_len': 1024, 'full_batch_size': 16, 'mxfp6_matmul': True, 'mxint8_kv_cache': True, 'num_cores': 16, 'num_devices': 4, 'prefill_seq_len': 128}}, 'arch': 'x86', 'gpu': False}}}}}
User Images[]
Created Bysample.user@wallaroo.ai
Created At2025-07-16 18:51:02.396949+00:00
Updated At2025-07-16 18:51:02.396949+00:00
Replaces
Docker Run Command
docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
    -e CONFIG_CPUS=1.0 --cpus=5.0 --memory=13g \
    us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Podman Run Command
podman run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
    -e CONFIG_CPUS=1.0 --cpus=5.0 --memory=13g \
    us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Helm Install Command
helm install --atomic $HELM_INSTALL_NAME \
    oci://us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/charts/llamaqaicopenaiedge \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

Deploy on Edge Devices

By default, the pipeline publishes include deployment commands for Docker, Podman, or Helm based deployments. The following example cover Docker based deployments. For more details, see Edge and Multi-cloud Deployment and Inference.

Deploying ML Models with Qualcomm QAIC hardware with Intel GPUs in edge and multi-cloud environments via docker run require additional parameters.

For QAIC deployments via docker or podman, additional parameters are required depending on the devices used.

  • For All Devices: For all devices on the edge deployment, the parameter --privileged is required. The following example is a sample command deploying a Wallaroo pipeline published in an OCI registry on an edge device with QAIC AI accelerators. For example:

    • Docker example:

      docker run -it -p 8080:8080 \
          -e DEBUG=true \
          -e CONFIG_CPUS=16 \
          -e OCI_USERNAME=oauth2accesstoken \
          -e OCI_PASSWORD=$tok \
          -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
          --privileged \
          --cpuset-cpus 0-15 --cpus 16 \
          us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261
      
  • For specific devices, each device is specified via the --device parameter. The following example specifies devices accel4 through accel7.

    • Docker Example

      docker run -p 8080:8080 \
      -e CONFIG_CPUS=16 \
      -e OCI_USERNAME=oauth2accesstoken \
      -e OCI_PASSWORD=$tok \
      -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
      --device=/dev/accel/accel4 \
      --device=/dev/accel/accel5 \
      --device=/dev/accel/accel6 \
      --device=/dev/accel/accel7 \
      --cpuset-cpus 0-15 \
      --cpus 16 \
      us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261
      

Tutorials

Troubleshooting

The specified model optimization configuration is not available

  • If the model acceleration option is set to QAIC, but the architecture is set to an incompatible architecture (aka anything other than X86):
    • The upload, deployment and publish operations fail with the following error message: “The specified model optimization configuration is not available. Please try this operation again using a different configuration or contact Wallaroo at support@wallaroo.ai for questions or help.”