Inference With Hardware Accelerators

How to use package models to run with hardware accelerators

Wallaroo supports deploying models that run with hardware accelerators that increase inference speed and performance.

Deploying models with AI hardware accelerators through through Wallaroo uses the following process depending on whether the model is deployed in Wallaroo, or deployed in an edge or multi-cloud environment.

Prerequisites

The following prerequisites must be met before uploading and deploying models with hardware accelerators.

Hardware Availability: Hardware accelerators must be available in the environment that the model is deployed in.
- For instructions on adding GPU to Kubernetes clusters, see see Create GPU Nodepools for Kubernetes Clusters.
- For details on adding ARM nodes to a cluster, see Create ARM Nodepools for Kubernetes Clusters.

Supported Accelerators

The following accelerators are supported:

Accelerator	ARM Support	X64/X86 Support	Intel GPU	Nvidia GPU	Description
`None`	N/A	N/A	N/A	N/A	The default acceleration, used for all scenarios and architectures.
`AIO`	√	X	X	X	AIO acceleration for Ampere Optimized trained models, only available with ARM processors.
`Jetson`	√	X	X	√	Nvidia Jetson acceleration used with edge deployments with ARM processors.
`CUDA`	√	√	X	√	NVIDIA CUDA acceleration supported by both ARM and X64/X86 processors. Intended for deployment with Nvidia GPUs. See Nvidia Jetson Deployment Scenario for additional requirements.
`OpenVINO`	X	√	√	X	Intel OpenVino acceleration. AI Accelerator from Intel compatible with x86/64 architectures. Aimed at edge and multi-cloud deployments either with or without Intel GPUs.
`QAIC`	X	√	X	X	Qualcomm Cloud AI. AI acceleration compatible with x86/64 architectures. For details on LLM deployment optimizations with QAIC, see LLM Inference with Qualcomm QAIC

Upload Model

Accelerators are set during model upload and packaging. To change the accelerator used with a model, re-upload the model with the new accelerator setting for maximum compatibility and support.

Models uploaded to Wallaroo have the accelerator set via the wallaroo.client.upload_model method’s Optional accel: wallaroo.engine_config.Acceleration parameter. For more details on model uploads and other parameters, see Automated Model Packaging.

Parameter	Type	Description
`name`	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
`path`	`string` (Required)	The path to the model file being uploaded.
`framework`	`string` (Required)	The framework of the model from `wallaroo.framework`
`input_schema`	`pyarrow.lib.Schema` Native Wallaroo Runtimes: (Optional) Non-Native Wallaroo Runtimes: (Required)	The input schema in Apache Arrow schema format.
`output_schema`	`pyarrow.lib.Schema` Native Wallaroo Runtimes: (Optional) Non-Native Wallaroo Runtimes: (Required)	The output schema in Apache Arrow schema format.
`arch`	wallaroo.engine_config.Architecture (Optional)	The architecture the model is deployed to. If a model is intended for deployment to an architecture other than X86, it must be specified during this step. Values include: `X86` (Default): x86 based architectures. `ARM`: ARM based architectures. `Power10`: Power10 based architectures.
`accel`	`wallaroo.engine_config.Acceleration` (Optional)	The AI hardware accelerator used. If a model is intended for use with a hardware accelerator, it should be assigned at this step. `wallaroo.engine_config.Acceleration._None` (Default): No accelerator is assigned. This works for all infrastructures. `wallaroo.engine_config.Acceleration.AIO`: AIO acceleration for Ampere Optimized trained models, only available with ARM processors. `wallaroo.engine_config.Acceleration.Jetson`: Nvidia Jetson acceleration used with edge deployments with ARM processors. See Nvidia Jetson Deployment Scenario for additional requirements. `wallaroo.engine_config.Acceleration.CUDA`: NVIDIA CUDA acceleration supported by both ARM and X64/X86 processors. This is intended for deployment with Nvidia GPUs. `wallaroo.engine_config.Acceleration.OpenVINO`: Intel OpenVino acceleration. AI Accelerator from Intel compatible with x86/64 architectures. Aimed at edge and multi-cloud deployments either with or without Intel GPUs. `wallaroo.engine_config.Acceleration.QAIC`: Qualcomm Cloud AI. AI acceleration compatible with x86/64 architectures.

The following is a generic template for uploading models with the Wallaroo SDK.

model = wl.upload_model(name={Model Name}
                        path={Model File Path}
                        framework=Framework.{Wallaroo Framework},
                        arch=wallaroo.engine_config.Architecture.{X86 | ARM}, # defaults to X86
                        accel=wallaroo.engine_config.Acceleration.{Accelerator}) # defaults to None

Deploy and Infer in the Cloud

Models deployed in the cloud in the Wallaroo Ops center take the following steps:

Create a pipeline and add the model as a pipeline step.
Deploy the model. This allocates resources for the models use.

When deploying a model, the deployment configurations inherits the model’s accelerator and architecture settings. Other settings, such as the number of CPUs, amount of RAM, etc can be changed without modifying the accelerator setting are modified with the deployment configuration; for full details, see Pipeline Deployment Configuration. If no deployment configuration is provided, then the default resource allocations are used.

To change the accelerator settings, models should be re-uploaded as either a new model or a new model version for maximum compatibility with the hardware infrastructure. For more information on uploading models or new model versions, see Model Accelerator at Model Upload.

The following is a generic template for deploying models in Wallaroo with a deployment configuration.

# create the pipeline

pipeline = wl.build_pipeline(pipeline_name={Pipeline Name})

# add the model as a pipeline step

pipeline.add_model_step(model)

# create a deployment configuration

deployment_config = wallaroo.DeploymentConfigBuilder() \
    .cpus({Number of CPUs}) \
    .memory("{Amount of RAM}") \
    .build()

# deploy the model with the deployment configuration
pipeline.deploy(deployment_config = deployment_config)

Deploy in Edge and Multi-cloud Environments

Deploying a model in an Edge and Multi-cloud environment takes two steps after uploading the model:

Publish the Model: The model with its AI Accelerator settings is published to an OCI (Open Container Initiative) Registry.
Deploy the Model on Edge: The model is deployed in an environment with hardware matching the AI Accelerator and architecture via docker or helm.

Publish the Model

Publishing the pipeline to uses the method [wallaroo.pipeline.publish(deployment_config)].

This requires that the Wallaroo Ops has Edge Registry Services enabled.

The deployment configuration for the pipeline publish inherits the model’s accelerator. Options such as the number of cpus, amount of memory, etc can be adjusted without impacting the model’s accelerator settings.

A deployment configuration must be included with the pipeline publish, even if no changes to the cpus, memory, etc are made. For more detail on deployment configurations, see Wallaroo SDK Essentials Guide: Pipeline Deployment Configuration.

Pipelines do not need to be deployed in the Wallaroo Ops Center before publishing the pipeline. This is useful in multicloud deployments to edge devices with different hardware accelerators than the Wallaroo Ops instance.

To change the model acceleration settings, upload the model as a new model or model version with the new acceleration settings.

For more information, see Wallaroo SDK Essentials Guide: Pipeline Edge Publication.

The Wallaroo SDK publish includes the docker run and the helm run fields. These provide “copy and paste” scripts for deploying the model in edge and multi-cloud environments.

The following template shows publishing the model to the OCI (Open Container Initiative) Registry associated with the Wallaroo Ops Center, and the abbreviated output.

# default deployment configuration
publish = pipeline.publish(deployment_config=deployment_config)
display(publish)

Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is publishing....... Published.

ID 87

...(additional rows)

Docker Run Command

docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=ghcr.io/wallaroolabs/doc-samples/pipelines/sample-pipeline-tutorial:1ff19772-f41f-42fb-b0d1-f82130bf5801 \
    -e CONFIG_CPUS=4 --cpus=4.0 --memory=3g \
    ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:v2025.1.0-6250

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.

Podman Run Command

podman run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=ghcr.io/wallaroolabs/doc-samples/pipelines/sample-pipeline-tutorial:1ff19772-f41f-42fb-b0d1-f82130bf5801 \
    -e CONFIG_CPUS=4 --cpus=4.0 --memory=3g \
    ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:v2025.1.0-6250

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.

Helm Install Command

helm install --atomic $HELM_INSTALL_NAME \
    oci://ghcr.io/wallaroolabs/doc-samples/charts/sample-pipeline-tutorial \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-1ff19772-f41f-42fb-b0d1-f82130bf5801 \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

Model Deployment on Edge

Models published via the Wallaroo SDK return docker run and helm install template for deploying the model on edge and multi-cloud environments. The docker run commands require the following environmental variables set before executing:

$PERSISTENT_VOLUME_DIR (Optional: Used with Edge and Multi-cloud Observability): The directory path for the model deployment’s persistent volume. This stores session and other data for connecting back to the Wallaroo instance for inference logs and other uses.
$EDGE_PORT: The network port used to submit inference requests to the deployed model.
$OCI_USERNAME: The user name or identifier to authenticate to the OCI (Open Container Initiative) Registry where the model was published.
$OCI_PASSWORD: The password or token to authenticate to the OCI (Open Container Initiative) Registry where the model was published.

The following docker run template is returned by Wallaroo during pipeline publish for deploying the model in an edge and multi-cloud environment.

docker run -v $PERSISTENT_VOLUME_DIR:/persist \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL={PIPELINE_URL}:{PIPELINE_VERSION}\
    {Wallaroo_Engine_URL}:{WALLAROO_ENGINE_VERSION}

The following podman run template is returned by Wallaroo during pipeline publish for deploying the model in an edge and multi-cloud environment.

podman run -v $PERSISTENT_VOLUME_DIR:/persist \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL={PIPELINE_URL}:{PIPELINE_VERSION}\
    {Wallaroo_Engine_URL}:{WALLAROO_ENGINE_VERSION}

The following helm install template is returned by Wallaroo during pipeline publish for deploying the model in an edge and multi-cloud environment.

helm install --atomic $HELM_INSTALL_NAME \
    {HELM CHART REGISTRY URL} \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version {HELM VERSION} \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Intel OpenVino Deployment Scenario with Intel GPUs

Deploying ML Models with Intel OpenVINO hardware with Intel GPUs in edge and multi-cloud environments via docker run require additional parameters.

For more details, see:

Running in privileged Docker deployments:Runtime privilege and Linux capabilities
Running in unprivileged Docker deployments: Additional groups

For ML models deployed on OpenVino hardware with Intel GPUs, docker run must include the following options:

--rm -it --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* ) --ulimit nofile=262144:262144 --cap-add=sys_nice

For example, the following docker run templates demonstrates deploying a Wallaroo published model on OpenVINO hardware with Intel GPUs:

docker run -v $PERSISTENT_VOLUME_DIR:/persist \
    --rm -it --device /dev/dri \
    --group-add=$(stat -c "%g" /dev/dri/render* ) \
    --ulimit nofile=262144:262144 --cap-add=sys_nice \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL={PIPELINE_URL}:{PIPELINE_VERSION} \
    {Wallaroo_Engine_URL}:{WALLAROO_ENGINE_VERSION}

Nvidia Jetson Deployment Scenario

ML models published to OCI registries via the Wallaroo SDK are provided with the Docker Run Command: a sample docker script for deploying the model on edge and multicloud environments. For more details, see Edge and Multi-cloud Model Publish and Deploy.

The following deployment prerequisites must be met on the target device where the models are deployed to.

Hardware Requirements:
- Jetson Orin Nano Board
Software Requirements:
- Jetpack 6 or later. This includes CUDA 12.2.
- CUDA 12.2

Setup the Docker access with the following command before deploying

sudo usermod -aG docker $USER
newgrp docker

For ML models deployed on Jetson accelerated hardware via Docker, the docker command is replace by the docker run --runtime nvidia --privileged --gpus all application. For details on installing Nvidia Container Toolkit, see Installing the NVIDIA Container Toolkit. For example:

docker run --runtime nvidia --privileged --gpus all \
    -v $PERSISTENT_VOLUME_DIR:/persist \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=ghcr.io/wallaroolabs/doc-samples/pipelines/sample-edge-deploy:446aeed9-2d52-47ae-9e5c-f2a05ef0d4d6\
    -e EDGE_BUNDLE=abc123 \
    ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-jetson:v2025.1.0-5849

See Nvidia Specialized Configurations with Docker for further details.

QAIC Deployment Scenario

Deploying ML Models with Qualcomm QAIC hardware with Intel GPUs in edge and multi-cloud environments via docker run require additional parameters.

For QAIC deployments via docker or podman, additional parameters are required depending on the devices used.

For All Devices: For all devices on the edge deployment, the parameter --privileged is required. The following example is a sample command deploying a Wallaroo pipeline published in an OCI registry on an edge device with QAIC AI accelerators. For example:

Docker example:

docker run -it -p 8080:8080 \
    -e DEBUG=true \
    -e CONFIG_CPUS=16 \
    -e OCI_USERNAME=oauth2accesstoken \
    -e OCI_PASSWORD=$tok \
    -e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
    --privileged \
    --cpuset-cpus 0-15 --cpus 16 \
    us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261

For specific devices, each device is specified via the --device parameter. The following example specifies devices accel4 through accel7.

Docker Example

docker run -p 8080:8080 \
-e CONFIG_CPUS=16 \
-e OCI_USERNAME=oauth2accesstoken \
-e OCI_PASSWORD=$tok \
-e PIPELINE_URL=us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/pipelines/llamaqaicopenaiedge:a0db44db-cc58-437e-9e73-2da3e3ae45e9 \
--device=/dev/accel/accel4 \
--device=/dev/accel/accel5 \
--device=/dev/accel/accel6 \
--device=/dev/accel/accel7 \
--cpuset-cpus 0-15 \
--cpus 16 \
us-west1-docker.pkg.dev/wallaroo-dev-253816/testqaic/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-qaic-vllm:v2025.1.0-6261

Model Accelerator Deployment Troubleshooting

If the specified hardware accelerator or infrastructure is not available in the Wallaroo Ops cluster during deployment, the following error message is displayed:

“The specified model architecture configuration is not available. Please try this operation again using a different configuration or contact support@wallaroo.ai for questions or help.”

Tutorials

The following examples are available to demonstrate uploading and publishing models with hardware accelerator support.

Inference on Qualcomm QAIC AI Acceleration

How to deploy AI/ML models with Qualcomm QAIC Processors

Inference with NVIDIA GPUs

How to use package models to run on NVIDIA GPUs