Inference With Hardware Accelerators
Table of Contents
Wallaroo supports deploying models that run with hardware accelerators that increase inference speed and performance.
Deploying models with AI hardware accelerators through through Wallaroo uses the following process depending on whether the model is deployed in Wallaroo, or deployed in an edge or multi-cloud environment.
Prerequisites
The following prerequisites must be met before uploading and deploying models with hardware accelerators.
- Hardware Availability: Hardware accelerators must be available in the environment that the model is deployed in.
- For instructions on adding GPU to Kubernetes clusters, see see Create GPU Nodepools for Kubernetes Clusters.
- For details on adding ARM nodes to a cluster, see Create ARM Nodepools for Kubernetes Clusters.
Supported Accelerators
The following accelerators are supported:
Accelerator | ARM Support | X64/X86 Support | Intel GPU | Nvidia GPU | Description |
---|---|---|---|---|---|
None | N/A | N/A | N/A | N/A | The default acceleration, used for all scenarios and architectures. |
AIO | √ | X | X | X | AIO acceleration for Ampere Optimized trained models, only available with ARM processors. |
Jetson | √ | X | X | √ | Nvidia Jetson acceleration used with edge deployments with ARM processors. |
CUDA | √ | √ | X | √ | Nvidia Cuda acceleration supported by both ARM and X64/X86 processors. Intended for deployment with Nvidia GPUs. See Nvidia Jetson Deployment Scenario for additional requirements. |
OpenVINO | X | √ | √ | X | Intel OpenVino acceleration. AI Accelerator from Intel compatible with x86/64 architectures. Aimed at edge and multi-cloud deployments either with or without Intel GPUs. |
Upload Model
Accelerators are set during model upload and packaging. To change the accelerator used with a model, re-upload the model with the new accelerator setting for maximum compatibility and support.
Models uploaded to Wallaroo have the accelerator set via the wallaroo.client.upload_model
method’s Optional accel: wallaroo.engine_config.Acceleration
parameter. For more details on model uploads and other parameters, see Automated Model Packaging.
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework |
input_schema | pyarrow.lib.Schema
| The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema
| The output schema in Apache Arrow schema format. |
arch | wallaroo.engine_config.Architecture (Optional) | The architecture the model is deployed to. If a model is intended for deployment to an architecture other than X86, it must be specified during this step. Values include:
|
accel | wallaroo.engine_config.Acceleration (Optional) | The AI hardware accelerator used. If a model is intended for use with a hardware accelerator, it should be assigned at this step.
|
The following is a generic template for uploading models with the Wallaroo SDK.
model = wl.upload_model(name={Model Name}
path={Model File Path}
framework=Framework.{Wallaroo Framework},
arch=wallaroo.engine_config.Architecture.{X86 | ARM}, # defaults to X86
accel=wallaroo.engine_config.Acceleration.{Accelerator}) # defaults to None
Deploy and Infer in the Cloud
Models deployed in the cloud in the Wallaroo Ops center take the following steps:
- Create a pipeline and add the model as a pipeline step.
- Deploy the model. This allocates resources for the models use.
When deploying a model, the deployment configurations inherits the model’s accelerator and architecture settings. Other settings, such as the number of CPUs, amount of RAM, etc can be changed without modifying the accelerator setting are modified with the deployment configuration; for full details, see Pipeline Deployment Configuration. If no deployment configuration is provided, then the default resource allocations are used.
To change the accelerator settings, models should be re-uploaded as either a new model or a new model version for maximum compatibility with the hardware infrastructure. For more information on uploading models or new model versions, see Model Accelerator at Model Upload.
The following is a generic template for deploying models in Wallaroo with a deployment configuration.
# create the pipeline
pipeline = wl.build_pipeline(pipeline_name={Pipeline Name})
# add the model as a pipeline step
pipeline.add_model_step(model)
# create a deployment configuration
deployment_config = wallaroo.DeploymentConfigBuilder() \
.cpus({Number of CPUs}) \
.memory("{Amount of RAM}") \
.build()
# deploy the model with the deployment configuration
pipeline.deploy(deployment_config = deployment_config)
Deploy in Edge and Multi-cloud Environments
Deploying a model in an Edge and Multi-cloud environment takes two steps after uploading the model:
- Publish the Model: The model with its AI Accelerator settings is published to an OCI (Open Container Initiative) Registry.
- Deploy the Model on Edge: The model is deployed in an environment with hardware matching the AI Accelerator and architecture via
docker
orhelm
.
Publish the Model
Publishing the pipeline to uses the method [wallaroo.pipeline.publish(deployment_config)
].
This requires that the Wallaroo Ops has Edge Registry Services enabled.
The deployment configuration for the pipeline publish inherits the model’s accelerator. Options such as the number of cpus, amount of memory, etc can be adjusted without impacting the model’s accelerator settings.
A deployment configuration must be included with the pipeline publish, even if no changes to the cpus, memory, etc are made. For more detail on deployment configurations, see Wallaroo SDK Essentials Guide: Pipeline Deployment Configuration.
Pipelines do not need to be deployed in the Wallaroo Ops Center before publishing the pipeline. This is useful in multicloud deployments to edge devices with different hardware accelerators than the Wallaroo Ops instance.
To change the model acceleration settings, upload the model as a new model or model version with the new acceleration settings.
For more information, see Wallaroo SDK Essentials Guide: Pipeline Edge Publication.
The Wallaroo SDK publish includes the docker run
and the helm run
fields. These provide “copy and paste” scripts for deploying the model in edge and multi-cloud environments.
The following template shows publishing the model to the OCI (Open Container Initiative) Registry associated with the Wallaroo Ops Center, and the abbreviated output.
# default deployment configuration
publish = pipeline.publish(deployment_config=deployment_config)
display(publish)
Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is publishing....... Published.
ID | 87 | |
...(additional rows) | ||
Docker Run Command |
Note: Please set the EDGE_PORT , OCI_USERNAME , and OCI_PASSWORD environment variables. | |
Helm Install Command |
Note: Please set the HELM_INSTALL_NAME , HELM_INSTALL_NAMESPACE ,
OCI_USERNAME , and OCI_PASSWORD environment variables. |
Model Deployment on Edge
Models published via the Wallaroo SDK return docker run
and helm install
template for deploying the model on edge and multi-cloud environments. The docker run
commands require the following environmental variables set before executing:
$PERSISTENT_VOLUME_DIR
(Optional: Used with Edge and Multi-cloud Observability): The directory path for the model deployment’s persistent volume. This stores session and other data for connecting back to the Wallaroo instance for inference logs and other uses.$EDGE_PORT
: The network port used to submit inference requests to the deployed model.$OCI_USERNAME
: The user name or identifier to authenticate to the OCI (Open Container Initiative) Registry where the model was published.$OCI_PASSWORD
: The password or token to authenticate to the OCI (Open Container Initiative) Registry where the model was published.
The following docker run
template is returned by Wallaroo during pipeline publish for deploying the model in an edge and multi-cloud environment.
docker run -v $PERSISTENT_VOLUME_DIR:/persist \
-p $EDGE_PORT:8080 \
-e OCI_USERNAME=$OCI_USERNAME \
-e OCI_PASSWORD=$OCI_PASSWORD \
-e PIPELINE_URL={PIPELINE_URL}:{PIPELINE_VERSION}\
{Wallaroo_Engine_URL}:{WALLAROO_ENGINE_VERSION}
The following helm install
template is returned by Wallaroo during pipeline publish for deploying the model in an edge and multi-cloud environment.
helm install --atomic $HELM_INSTALL_NAME \
{HELM CHART REGISTRY URL} \
--namespace $HELM_INSTALL_NAMESPACE \
--version {HELM VERSION} \
--set ociRegistry.username=$OCI_USERNAME \
--set ociRegistry.password=$OCI_PASSWORD
Intel OpenVino Deployment Scenario with Intel GPUs
Deploying ML Models with Intel OpenVINO hardware with Intel GPUs in edge and multi-cloud environments via docker run
require additional parameters.
For more details, see:
- Running in privileged Docker deployments:Runtime privilege and Linux capabilities
- Running in unprivileged Docker deployments: Additional groups
For ML models deployed on OpenVino hardware with Intel GPUs, docker run
must include the following options:
--rm -it --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* ) --ulimit nofile=262144:262144 --cap-add=sys_nice
For example, the following docker run
templates demonstrates deploying a Wallaroo published model on OpenVINO hardware with Intel GPUs:
docker run -v $PERSISTENT_VOLUME_DIR:/persist \
--rm -it --device /dev/dri \
--group-add=$(stat -c "%g" /dev/dri/render* ) \
--ulimit nofile=262144:262144 --cap-add=sys_nice \
-p $EDGE_PORT:8080 \
-e OCI_USERNAME=$OCI_USERNAME \
-e OCI_PASSWORD=$OCI_PASSWORD \
-e PIPELINE_URL={PIPELINE_URL}:{PIPELINE_VERSION} \
{Wallaroo_Engine_URL}:{WALLAROO_ENGINE_VERSION}
Nvidia Jetson Deployment Scenario
ML models published to OCI registries via the Wallaroo SDK are provided with the Docker Run Command: a sample docker
script for deploying the model on edge and multicloud environments. For more details, see Edge and Multicloud Model Publish and Deploy.
The following deployment prerequisites must be met on the target device where the models are deployed to.
- Hardware Requirements:
- Software Requirements:
- Jetpack 6 or later. This includes CUDA 12.2.
- Cuda 12.2
Setup the Docker access with the following command before deploying
sudo usermod -aG docker $USER
newgrp docker
For ML models deployed on Jetson accelerated hardware via Docker, the docker
command is replace by the docker run --runtime nvidia --privileged --gpus all
application. For details on installing Nvidia Container Toolkit, see Installing the NVIDIA Container Toolkit. For example:
docker run --runtime nvidia --privileged --gpus all \
-v $PERSISTENT_VOLUME_DIR:/persist \
-e OCI_USERNAME=$OCI_USERNAME \
-e OCI_PASSWORD=$OCI_PASSWORD \
-e PIPELINE_URL=ghcr.io/wallaroolabs/doc-samples/pipelines/sample-edge-deploy:446aeed9-2d52-47ae-9e5c-f2a05ef0d4d6\
-e EDGE_BUNDLE=abc123 \
ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-jetson:v2024.4.0-5849
See Nvidia Specialized Configurations with Docker for further details.
Model Accelerator Deployment Troubleshooting
If the specified hardware accelerator or infrastructure is not available in the Wallaroo Ops cluster during deployment, the following error message is displayed:
Tutorials
The following examples are available to demonstrate uploading and publishing models with hardware accelerator support.