AI/ML models can be deployed in centralized Wallaroo OPs instances and Edge devices on a variety of infrastructures and processors. The CPU infrastructure and AI acceleration type is set during the model upload and packaging stage.
Wallaroo supports Qualcomm QAIC, providing high performance x86 compatible processing with AI acceleration at low power costs. This increases the performance of LLM models with lower energy needs.
For details on using QAIC with Wallaroo and setting up a demonstration:
QAIC AI Acceleration delivers a x86 compatible architecture with AI acceleration with low power cost. The following Wallaroo features are supported for LLMs with QAIC AI acceleration deployed in Wallaroo:
To upload and package a model for Wallaroo Ops or multicloud edge deployments, the following prerequisites must be met.
The Wallaroo SDK provides QAIC support for models uploaded for Wallaroo Ops.
Models are uploaded to Wallaroo via the wallaroo.client.upload_model
method. For QAIC support, the following architecture and acceleration settings are used:
accel
parameter. For QAIC, this accepts the wallaroo.engine_config.Acceleration.QAIC
.Note that QAIC processors are x86 compatible, so no changes are needed to the model upload default architecture of X86
.
The method wallaroo.client.Client.upload_model
takes the following parameters:
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework.Framework . For native vLLM, this framework is wallaroo.framework.Framework.VLLM . |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. |
framework_config | wallaroo.framework.VLLMConfig (Optional) | Sets the vLLM framework configuration options. |
accel | wallaroo.engine_config.Acceleration.QAIC (Required) OR wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig) (Optional) | The AI hardware accelerator used. Submitting with the with_config(QaicConfig) parameters overrides the hardware performance defaults. |
convert_wait | bool (Optional) |
|
QAIC hardware performance is configurable at model upload with wallaroo.engine_config.Acceleration.QAIC.with_config(wallaroo.engine_config.QaicConfig)
. This provides additional hardware fine tuning. If no acceleration parameters are defined, the default values are applied.
wallaroo.engine_config.QaicConfig
takes the following parameters.
Parameters | Type | Description |
---|---|---|
num_cores | Integer (Default: 16 ) | Number of cores used to compile the model. |
num_devices | Integer (Default: 1 ) | Number of System-on-Chip (SoC) in a given card to compile the model for. |
ctx_len | Integer (*Default: 128 ) | Maximum context that the compiled model remembers. |
prefill_seq_len | Integer | The length of the Prefill prompt. |
full_batch_size | Integer (Default: None ) | Maximum number of sequences per iteration. Set to enable continuous batching mode. |
mxfp6_matmul | Boolean (Default: False ) | Enable compilation for MXFP6 precision. |
mxint8_kv_cache | Boolean (Default: False ) | Compress Present/Past KV to MXINT8. |
aic_enable_depth_first | Boolean (Default: False ) | Enables DFS with default memory size. |
The following demonstrates uploading a model for deployment on with QAIC AI acceleration. The input and output schemas are optional depending on the model runtime. For more details, see Model Upload
The following shows uploading the LLM with QAIC AI acceleration enabled without the acceleration configuration options.
import wallaroo
# set the Wallaroo client
wl = wallaroo.Client()
model = wl.upload_model(
model_name,
model_file_name,
framework=framework,
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.QAIC
)
The following demonstrates uploading the LLM with QAIC AI acceleration enabled with the acceleration configuration options.
import wallaroo
# set the Wallaroo client
wl = wallaroo.Client()
# Set the QAIC acceleration parameters. This is an **optional** step
qaic_config = wallaroo.engine_config.QaicConfig(
num_devices=4,
full_batch_size=16,
ctx_len=256,
prefill_seq_len=128,
mxfp6_matmul=True,
mxint8_kv_cache=True
)
model = wl.upload_model(
"sample-model-name",
"sample-model-file.zip",
framework=framework
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.QAIC.with_config(qaic_config)
)
Models are added to pipeline as pipeline steps. Models are then deployed through the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig] = None)
method.
When deploying a model in a Wallaroo Ops instance, the deployment configurations inherits the model acceleration setting.Other settings, such as the number of CPUs, etc can be changed without modifying the acceleration setting.
The deployment configuration sets what resources are allocated for the model. For this example, the model is allocated the following:
gpu
parameter specifies the number of SoCs allocated.from wallaroo.deployment_config import DeploymentConfigBuilder
deployment_config = DeploymentConfigBuilder() \
.replica_autoscale_min_max(minimum=1, maximum=2) \
.cpus(1).memory('1Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '12Gi') \
.sidekick_gpus(model, 4) \
.deployment_label("kubernetes.io/os:linux") \
.build()
To change the acceleration settings for model deployment, models should be re-uploaded as either a new model or a new model version for maximum compatibility with the hardware infrastructure.
The following demonstrates deploying a generic AI/ML model with the architecture set to QAIC. For this example, the model is deployed with a pre-determined deployment configuration saved to deployment_config
.
# create the pipeline
pipeline = wl.build_pipeline("sample_pipeline")
# set the pipeline model step as the model set to the Power10 architecture
pipeline.add_model_step(model)
# deploy the pipeline with the deployment configuration
pipeline.deploy(deployment_configuration)
Wallaroo supports deploying models on edge and multi-cloud environments through publishing the Wallaroo pipeline with the model and deployment configuration to an Open Container Initiative (OCI) registry. These pipeline publishes are deployed on devices with QAIC hardware.
Wallaroo pipelines are published to an OCI registry with the following elements:
QAIC
.Pipelines are published as images to the edge registry set in the Enable Wallaroo Edge Registry with the wallaroo.pipeline.Pipeline.publish
method.
When a pipeline is published, the containerized pipeline with its models, and the inference engine for the architecture and acceleration uploaded to the OCI registry. Once published, the publish is deployed on edge locations with Docker, Podman, or helm
based deployments. For more details, see Edge and Multi-cloud Pipeline Publish.
The wallaroo.pipeline.Pipeline.publish
method takes the following parameters. The containerized pipeline will be pushed to the Edge registry service with the model, pipeline configurations, and other artifacts needed to deploy the pipeline.
Parameter | Type | Description |
---|---|---|
deployment_config | wallaroo.deployment_config.DeploymentConfig (Optional) | Sets the pipeline deployment configuration. For more information on pipeline deployment configuration, see the Wallaroo SDK Essentials Guide: Pipeline Deployment Configuration. |
replaces | **List[wallaroo.pipeline_publish] (Optional) | The pipeline publish(es) to replace. |
The following publish fields are displayed with the method IPython.display
.
Field | Type | Description |
---|---|---|
ID | Integer | The numerical ID of the publish. |
Pipeline Name | String | The pipeline the publish was generated from. |
Pipeline Version | String | The pipeline version the publish was generated from, in UUID format. |
Status | String | The status of the publish. Values include:
|
Workspace Id | Integer | The numerical id of the workspace the publish is associated with. |
Workspace Name | String | The name of the workspace the publish is associated with. |
Edges | List[String] | A list of edges associated with this publish. If no edges exist, this field will be empty. |
Engine URL | String | The OCI Registry URL for the inference engine. |
Pipeline URL | String | The OCI Registry URL of the containerized pipeline. |
Helm Chart URL | String | The OCI Registry URL of the Helm chart. |
Helm Chart Reference | String | The OCI Registry URL of the Helm Chart reference. |
Helm Chart Version | String | The Helm Chart Version. |
Engine Config | Dict | The details of the wallaroo.engine_config used for the publish. Unless specified, it will use the same engine config for the pipeline, which inherits its arch and accel settings from the model upon upload. See Wallaroo SDK Essentials Guide: Model Uploads and Registrations for more details. |
User Images | List | Any user images used with the deployment. |
Created By | String | The user name, typically the email address, of the user that created the publish. |
Created At | DateTime | The DateTime of the publish was created. |
Updated At | DateTime | The DateTime of the publish was updated. |
Replaces | List | A list of the publishes that were replaced by this one with the following attributes. Note that each variable represents the value displayed:
|
Docker Run Command | String | The Docker Run commands for the publish. The following variables must be set before executing the command.
Additional options are detailed in the DevOps - Pipeline Edge Deployment |
Podman Run Command | String | The Podman run commands for each edge location for the publish. The following variables must be set before executing the command.
Additional options are detailed in the DevOps - Pipeline Edge Deployment |
Helm Install Command | String | The Helm Install or Upgrade commands for each location or replaced locations for the pipeline. For replaced publishes, the helm upgrade command is shown for performing in-line model updates. The following variables must be set before executing the command.
|
The following fields are available from the PipelinePublish
object.
Field | Type | Description |
---|---|---|
id | Integer | Numerical Wallaroo id of the published pipeline. |
pipeline_name | String | The name of the pipeline the publish is generated from. |
pipeline_version_id | Integer | Numerical Wallaroo id of the pipeline version published. |
status | String | The status of the pipeline publication. Values include:
|
engine_url | String | The URL of the published pipeline engine in the edge registry. |
pipeline_url | String | The URL of the published pipeline in the edge registry. |
pipeline_version_name | String | The pipeline version in UUID format. |
helm | Dict | The details used for a helm based deployment of the with the following attributes:
|
additional_properties | Dict | Any additional properties for the publish. |
docker_run_variables | The Docker Run variables used for Docker based deployments. This includes:
| |
list_edges() | wallaroo.edge.EdgeList | A List of wallaroo.edge.Edge associated with the publish. |
engine_url | String | The URL for the inference engine used for the edge deployment. |
user_images | List | A List of custom images used for the edge deployment. |
created_by | String | The unique identifier of the user ID that created the publish in UUID format. |
error | String | Any errors associated with the publish. |
engine_config | wallaroo.deployment_config.DeploymentConfig | The pipeline configuration included with the published pipeline. |
created_at | DateTime | When the published pipeline was created. |
updated_at | DateTime | When the published pipeline was updated. |
created_on_version | String | The version of Wallaroo the publish was generated from. |
replaces | List(Integer) | List of other publishes that were replaced by this one. |
The following example shows how to publish a pipeline to the edge registry service associated with the Wallaroo instance for an Edge device with QAIC AI accelerator chips. In this example, the deployment configuration is set with:
gpu
parameter specifies the number of System-on-Chips (SoCs) allocated.deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('1Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '12Gi') \
.sidekick_gpus(model, 4) \
.deployment_label("kubernetes.io/os:linux") \
.build()
The pipeline is then published with the `wallaroo.pipeline.Pipeline.publish(deployment_config) command.
pipeline.publish(deployment_config=deployment_config)
Waiting for pipeline publish... It may take up to 600 sec.
....................................................................................... Published.
ID | 20 | |
Pipeline Name | llamaqaicopenaiedge | |
Pipeline Version | a0db44db-cc58-437e-9e73-2da3e3ae45e9 | |
Status | Published | |
Workspace Id | 9 | |
Workspace Name | younes@wallaroo.ai - Default Workspace | |
Edges | ||
Engine URL | us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261 | |
Pipeline URL | us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 | |
Helm Chart URL | oci://us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/charts/llamaqaicopenaiedge | |
Helm Chart Reference | us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/charts@sha256:bf9847efbca0c798d823afb11820b7e74f802233d0762d60b409b2023ab04d2e | |
Helm Chart Version | 0.0.1-a0db44db-cc58-437e-9e73-2da3e3ae45e9 | |
Engine Config | {'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '1Gi'}, 'requests': {'cpu': 1.0, 'memory': '1Gi'}, 'accel': {'qaic': {'aic_enable_depth_first': False, 'ctx_len': 1024, 'full_batch_size': 16, 'mxfp6_matmul': True, 'mxint8_kv_cache': True, 'num_cores': 16, 'num_devices': 4, 'prefill_seq_len': 128}}, 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none', 'cpu_utilization': 50.0}, 'images': {'llama-qaic-openai-113': {'resources': {'limits': {'cpu': 4.0, 'memory': '12Gi'}, 'requests': {'cpu': 4.0, 'memory': '12Gi'}, 'accel': {'qaic': {'aic_enable_depth_first': False, 'ctx_len': 1024, 'full_batch_size': 16, 'mxfp6_matmul': True, 'mxint8_kv_cache': True, 'num_cores': 16, 'num_devices': 4, 'prefill_seq_len': 128}}, 'arch': 'x86', 'gpu': False}}}}} | |
User Images | [] | |
Created By | sample.user@wallaroo.ai | |
Created At | 2025-07-16 18:51:02.396949+00:00 | |
Updated At | 2025-07-16 18:51:02.396949+00:00 | |
Replaces | ||
Docker Run Command |
Note: Please set the EDGE_PORT , OCI_USERNAME , and OCI_PASSWORD environment variables. | |
Podman Run Command |
Note: Please set the EDGE_PORT , OCI_USERNAME , and OCI_PASSWORD environment variables. | |
Helm Install Command |
Note: Please set the HELM_INSTALL_NAME , HELM_INSTALL_NAMESPACE ,
OCI_USERNAME , and OCI_PASSWORD environment variables. |
By default, the pipeline publishes include deployment commands for Docker, Podman, or Helm based deployments. The following example cover Docker based deployments. For more details, see Edge and Multi-cloud Deployment and Inference.
Deploying ML Models with Qualcomm QAIC hardware with Intel GPUs in edge and multi-cloud environments via docker run
require additional parameters.
For QAIC deployments via docker
or podman
, additional parameters are required depending on the devices used.
For All Devices: For all devices on the edge deployment, the parameter --privileged
is required. The following example is a sample command deploying a Wallaroo pipeline published in an OCI registry on an edge device with QAIC AI accelerators. For example:
Docker example:
docker run -it -p 8080:8080 \
-e DEBUG=true \
-e CONFIG_CPUS=16 \
-e OCI_USERNAME=oauth2accesstoken \
-e OCI_PASSWORD=$tok \
-e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
--privileged \
--cpuset-cpus 0-15 --cpus 16 \
us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261
For specific devices, each device is specified via the --device
parameter. The following example specifies devices accel4
through accel7
.
Docker Example
docker run -p 8080:8080 \
-e CONFIG_CPUS=16 \
-e OCI_USERNAME=oauth2accesstoken \
-e OCI_PASSWORD=$tok \
-e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
--device=/dev/accel/accel4 \
--device=/dev/accel/accel5 \
--device=/dev/accel/accel6 \
--device=/dev/accel/accel7 \
--cpuset-cpus 0-15 \
--cpus 16 \
us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261
QAIC
, but the architecture is set to an incompatible architecture (aka anything other than X86
):“The specified model optimization configuration is not available. Please try this operation again using a different configuration or contact Wallaroo at support@wallaroo.ai for questions or help.”