Inference on Qualcomm QAIC AI Acceleration
Table of Contents
AI/ML models can be deployed in centralized Wallaroo OPs instances and Edge devices on a variety of infrastructures and processors. The CPU infrastructure and AI acceleration type is set during the model upload and packaging stage.
Wallaroo supports Qualcomm QAIC, providing high performance x86 compatible processing with AI acceleration at low power costs. This increases the performance of LLM models with lower energy needs.
For details on using QAIC with Wallaroo and setting up a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
QAIC AI Acceleration Features
QAIC AI Acceleration delivers a x86 compatible architecture with AI acceleration with low power cost. The following Wallaroo features are supported for LLMs with QAIC AI acceleration deployed in Wallaroo:
- OpenAI API Compatibility: Provides OpenAI API client compatible inference requests with optional token streaming.
- Replica autoscaling: Spin up or down replicas based on utilization criteria to optimize resource allocation and minimize costs.
- Continuous Batching: Improves throughput by dynamically grouping incoming requests in real time to optimize inference processing.
Model Packaging and Deployments Prerequisites for QAIC
To upload and package a model for Wallaroo Ops or multicloud edge deployments, the following prerequisites must be met.
- Wallaroo Ops
- At least one QAIC node deployed in the cluster.
AI Workloads for QAIC via the Wallaroo SDK
The Wallaroo SDK provides QAIC support for models uploaded for Wallaroo Ops.
Upload Models for QAIC via the Wallaroo SDK
Models are uploaded to Wallaroo via the wallaroo.client.upload_model
method. For QAIC support, the following architecture and acceleration settings are used:
- The AI acceleration is set with the
accel
parameter. For QAIC, this accepts thewallaroo.engine_config.Acceleration.QAIC
.- (Optional) Set the acceleration configuration options to fine tune hardware performance.
Note that QAIC processors are x86 compatible, so no changes are needed to the model upload default architecture of X86
.
The method wallaroo.client.Client.upload_model
takes the following parameters:
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework.Framework . For native vLLM, this framework is wallaroo.framework.Framework.VLLM . |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. |
framework_config | wallaroo.framework.VLLMConfig (Optional) | Sets the vLLM framework configuration options. |
accel | wallaroo.engine_config.Acceleration.QAIC (Required) OR wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig) (Optional) | The AI hardware accelerator used. Submitting with the with_config(QaicConfig) parameters overrides the hardware performance defaults. |
convert_wait | bool (Optional) |
|
QAIC hardware performance is configurable at model upload with wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig)
. This provides additional hardware fine tuning. If no acceleration parameters are defined, the default values are applied.
wallaroo.engine_config.QaicConfig
takes the following parameters.
Parameters | Type | Description |
---|---|---|
num_cores | Integer (Default: 16 ) | Number of cores used to compile the model. |
num_devices | Integer (Default: 1 ) | Number of System-on-Chip (SoC) in a given card to compile the model for. |
ctx_len | Integer (*Default: 128 ) | Maximum context that the compiled model remembers. |
prefill_seq_len | Integer | The length of the Prefill prompt. |
full_batch_size | Integer (Default: None ) | Maximum number of sequences per iteration. Set to enable continuous batching mode. |
mxfp6_matmul | Boolean (Default: False ) | Enable compilation for MXFP6 precision. |
mxint8_kv_cache | Boolean (Default: False ) | Compress Present/Past KV to MXINT8. |
aic_enable_depth_first | Boolean (Default: False ) | Enables DFS with default memory size. |
Upload Model for QAIC AI Acceleration via the Wallaroo SDK Example
The following demonstrates uploading a model for deployment on with QAIC AI acceleration. The input and output schemas are optional depending on the model runtime. For more details, see Model Upload
The following shows uploading the LLM with QAIC AI acceleration enabled without the acceleration configuration options.
import wallaroo
# set the Wallaroo client
wl = wallaroo.Client()
model = wl.upload_model(
model_name,
model_file_name,
framework=framework,
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.QAIC
)
The following demonstrates uploading the LLM with QAIC AI acceleration enabled with the acceleration configuration options.
import wallaroo
# set the Wallaroo client
wl = wallaroo.Client()
# Set the QAIC acceleration parameters. This is an **optional** step
qaic_config = wallaroo.engine_config.QaicConfig(
num_devices=4,
full_batch_size=16,
ctx_len=256,
prefill_seq_len=128,
mxfp6_matmul=True,
mxint8_kv_cache=True
)
model = wl.upload_model(
"sample-model-name",
"sample-model-file.zip",
framework=framework
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.QAIC.with_config(qaic_config)
)
Deploy Models for QAIC AI Acceleration via the Wallaroo SDK
Models are added to pipeline as pipeline steps. Models are then deployed through the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig] = None)
method.
When deploying a model in a Wallaroo Ops instance, the deployment configurations inherits the model acceleration setting.Other settings, such as the number of CPUs, etc can be changed without modifying the acceleration setting.
The deployment configuration sets what resources are allocated for the model. For this example, the model is allocated the following:
- cpus: 4
- RAM: 12 Gi
- gpus: 4
- For Wallaroo deployment configurations for QAIC, the
gpu
parameter specifies the number of SoCs allocated.
- For Wallaroo deployment configurations for QAIC, the
- Deployment label: Specifies the node with the QAIC SoCs.
from wallaroo.deployment_config import DeploymentConfigBuilder
deployment_config = DeploymentConfigBuilder() \
.replica_autoscale_min_max(minimum=1, maximum=2) \
.cpus(1).memory('1Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '12Gi') \
.sidekick_gpus(model, 4) \
.deployment_label("kubernetes.io/os:linux") \
.build()
To change the acceleration settings for model deployment, models should be re-uploaded as either a new model or a new model version for maximum compatibility with the hardware infrastructure.
The following demonstrates deploying a generic AI/ML model with the architecture set to QAIC. For this example, the model is deployed with a pre-determined deployment configuration saved to deployment_config
.
# create the pipeline
pipeline = wl.build_pipeline("sample_pipeline")
# set the pipeline model step as the model set to the Power10 architecture
pipeline.add_model_step(model)
# deploy the pipeline with the deployment configuration
pipeline.deploy(deployment_configuration)
Tutorials
Troubleshooting
The specified model optimization configuration is not available
- If the model acceleration option is set to
QAIC
, but the architecture is set to an incompatible architecture (aka anything other thanX86
):- The upload, deployment and publish operations fail with the following error message:
“The specified model optimization configuration is not available. Please try this operation again using a different configuration or contact Wallaroo at support@wallaroo.ai for questions or help.”
- The upload, deployment and publish operations fail with the following error message: