Inference on Qualcomm QAIC AI Acceleration

How to deploy AI/ML models with Qualcomm QAIC Processors

AI/ML models can be deployed in centralized Wallaroo OPs instances and Edge devices on a variety of infrastructures and processors. The CPU infrastructure and AI acceleration type is set during the model upload and packaging stage.

Wallaroo supports Qualcomm QAIC, providing high performance x86 compatible processing with AI acceleration at low power costs. This increases the performance of LLM models with lower energy needs.

For details on using QAIC with Wallaroo and setting up a demonstration:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

QAIC AI Acceleration Features

QAIC AI Acceleration delivers a x86 compatible architecture with AI acceleration with low power cost. The following Wallaroo features are supported for LLMs with QAIC AI acceleration deployed in Wallaroo:

OpenAI API Compatibility: Provides OpenAI API client compatible inference requests with optional token streaming.
Replica autoscaling: Spin up or down replicas based on utilization criteria to optimize resource allocation and minimize costs.
Continuous Batching: Improves throughput by dynamically grouping incoming requests in real time to optimize inference processing.

Model Packaging and Deployments Prerequisites for QAIC

To upload and package a model for Wallaroo Ops or multicloud edge deployments, the following prerequisites must be met.

Wallaroo Ops
- At least one QAIC node deployed in the cluster.

AI Workloads for QAIC via the Wallaroo SDK

The Wallaroo SDK provides QAIC support for models uploaded for Wallaroo Ops.

Upload Models for QAIC via the Wallaroo SDK

Models are uploaded to Wallaroo via the wallaroo.client.upload_model method. For QAIC support, the following architecture and acceleration settings are used:

The AI acceleration is set with the accel parameter. For QAIC, this accepts the wallaroo.engine_config.Acceleration.QAIC.
- (Optional) Set the acceleration configuration options to fine tune hardware performance.

Note that QAIC processors are x86 compatible, so no changes are needed to the model upload default architecture of X86.

The method wallaroo.client.Client.upload_model takes the following parameters:

Parameter	Type	Description
name	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
path	`string` (Required)	The path to the model file being uploaded.
framework	`string` (Required)	The framework of the model from `wallaroo.framework.Framework`. For native vLLM, this framework is `wallaroo.framework.Framework.VLLM`.
input_schema	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format.
output_schema	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format.
framework_config	`wallaroo.framework.VLLMConfig` (Optional)	Sets the vLLM framework configuration options.
accel	`wallaroo.engine_config.Acceleration.QAIC` (Required) OR `wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig)` (Optional)	The AI hardware accelerator used. Submitting with the `with_config(QaicConfig)` parameters overrides the hardware performance defaults.
convert_wait	`bool` (Optional)	True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

QAIC hardware performance is configurable at model upload with wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig). This provides additional hardware fine tuning. If no acceleration parameters are defined, the default values are applied.

wallaroo.engine_config.QaicConfig takes the following parameters.

Parameters	Type	Description
num_cores	Integer (Default: `16`)	Number of cores used to compile the model.
num_devices	Integer (Default: `1`)	Number of System-on-Chip (SoC) in a given card to compile the model for.
ctx_len	Integer (*Default: `128`)	Maximum context that the compiled model remembers.
prefill_seq_len	Integer	The length of the Prefill prompt.
full_batch_size	Integer (Default: `None`)	Maximum number of sequences per iteration. Set to enable continuous batching mode.
mxfp6_matmul	Boolean (Default: `False`)	Enable compilation for MXFP6 precision.
mxint8_kv_cache	Boolean (Default: `False`)	Compress Present/Past KV to MXINT8.
aic_enable_depth_first	Boolean (Default: `False`)	Enables DFS with default memory size.

Upload Model for QAIC AI Acceleration via the Wallaroo SDK Example

The following demonstrates uploading a model for deployment on with QAIC AI acceleration. The input and output schemas are optional depending on the model runtime. For more details, see Model Upload

The following shows uploading the LLM with QAIC AI acceleration enabled without the acceleration configuration options.

import wallaroo

# set the Wallaroo client
wl = wallaroo.Client()

model = wl.upload_model(
    model_name, 
    model_file_name, 
    framework=framework,
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC
)

The following demonstrates uploading the LLM with QAIC AI acceleration enabled with the acceleration configuration options.

import wallaroo

# set the Wallaroo client
wl = wallaroo.Client()

# Set the QAIC acceleration parameters.  This is an **optional** step
qaic_config = wallaroo.engine_config.QaicConfig(
    num_devices=4, 
    full_batch_size=16, 
    ctx_len=256, 
    prefill_seq_len=128, 
    mxfp6_matmul=True, 
    mxint8_kv_cache=True
)

model = wl.upload_model(
    "sample-model-name", 
    "sample-model-file.zip", 
    framework=framework
    input_schema=input_schema, 
    output_schema=output_schema, 
    accel=Acceleration.QAIC.with_config(qaic_config)
)

Deploy Models for QAIC AI Acceleration via the Wallaroo SDK

Models are added to pipeline as pipeline steps. Models are then deployed through the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig] = None) method.

When deploying a model in a Wallaroo Ops instance, the deployment configurations inherits the model acceleration setting.Other settings, such as the number of CPUs, etc can be changed without modifying the acceleration setting.

The deployment configuration sets what resources are allocated for the model. For this example, the model is allocated the following:

cpus: 4
RAM: 12 Gi
gpus: 4
- For Wallaroo deployment configurations for QAIC, the gpu parameter specifies the number of SoCs allocated.
Deployment label: Specifies the node with the QAIC SoCs.

from wallaroo.deployment_config import DeploymentConfigBuilder

deployment_config = DeploymentConfigBuilder() \
    .replica_autoscale_min_max(minimum=1, maximum=2) \
    .cpus(1).memory('1Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '12Gi') \
    .sidekick_gpus(model, 4) \
    .deployment_label("kubernetes.io/os:linux") \
    .build()

To change the acceleration settings for model deployment, models should be re-uploaded as either a new model or a new model version for maximum compatibility with the hardware infrastructure.

The following demonstrates deploying a generic AI/ML model with the architecture set to QAIC. For this example, the model is deployed with a pre-determined deployment configuration saved to deployment_config.

# create the pipeline
pipeline = wl.build_pipeline("sample_pipeline")

# set the pipeline model step as the model set to the Power10 architecture

pipeline.add_model_step(model)

# deploy the pipeline with the deployment configuration

pipeline.deploy(deployment_configuration)

Tutorials

Troubleshooting

The specified model optimization configuration is not available

If the model acceleration option is set to QAIC, but the architecture is set to an incompatible architecture (aka anything other than X86):
- The upload, deployment and publish operations fail with the following error message: “The specified model optimization configuration is not available. Please try this operation again using a different configuration or contact Wallaroo at support@wallaroo.ai for questions or help.”