Inference with Acceleration Libraries

How to use package models to run with hardware accelerators

Table of Contents

Wallaroo supports deploying models that run with hardware accelerators that increase the inference speed and performance.

Deploying models with AI hardware accelerators through through Wallaroo uses the following process depending on whether the model is deployed in Wallaroo, or deployed in an edge or multi-cloud environment.

Prerequisites

The following prerequisites must be met before uploading and deploying models with hardware accelerators.

Supported Accelerators

The following accelerators are supported:

AcceleratorARM SupportX64/X86 SupportIntel GPUNvidia GPUDescription
NoneN/AN/AN/AN/AThe default acceleration, used for all scenarios and architectures.
AIOXXXAIO acceleration for Ampere Optimized trained models, only available with ARM processors.
JetsonXXNvidia Jetson acceleration used with edge deployments with ARM processors.
CUDAXNvidia Cuda acceleration supported by both ARM and X64/X86 processors. Intended for deployment with Nvidia GPUs.
OpenVINOXXIntel OpenVino acceleration. AI Accelerator from Intel compatible with x86/64 architectures. Aimed at edge and multi-cloud deployments either with or without Intel GPUs.

Upload Model

Accelerators are set during model upload and packaging. To change the accelerator used with a model, re-upload the model with the new accelerator setting for maximum compatibility and support.

Models uploaded to Wallaroo have the accelerator set via the wallaroo.client.upload_model method’s Optional accel: wallaroo.engine_config.Acceleration parameter. For more details on model uploads and other parameters, see Automated Model Packaging.

ParameterTypeDescription
namestring (Required)The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
pathstring (Required)The path to the model file being uploaded.
frameworkstring (Required)The framework of the model from wallaroo.framework
input_schemapyarrow.lib.Schema
  • Native Wallaroo Runtimes: (Optional)
  • Non-Native Wallaroo Runtimes: (Required)
The input schema in Apache Arrow schema format.
output_schemapyarrow.lib.Schema
  • Native Wallaroo Runtimes: (Optional)
  • Non-Native Wallaroo Runtimes: (Required)
The output schema in Apache Arrow schema format.
archwallaroo.engine_config.Architecture (Optional)The architecture the model is deployed to. If a model is intended for deployment to an ARM architecture, it must be specified during this step. Values include:
  • X86 (Default): x86 based architectures.
  • ARM: ARM based architectures.
accelwallaroo.engine_config.Acceleration (Optional)The AI hardware accelerator used. If a model is intended for use with a hardware accelerator, it should be assigned at this step.
  • wallaroo.engine_config.Acceleration._None (Default): No accelerator is assigned. This works for all infrastructures.
  • wallaroo.engine_config.Acceleration.AIO: AIO acceleration for Ampere Optimized trained models, only available with ARM processors.
  • wallaroo.engine_config.Acceleration.Jetson: Nvidia Jetson acceleration used with edge deployments with ARM processors.
  • wallaroo.engine_config.Acceleration.CUDA: Nvidia Cuda acceleration supported by both ARM and X64/X86 processors. This is intended for deployment with Nvidia GPUs.
  • wallaroo.engine_config.Acceleration.OpenVINO: Intel OpenVino acceleration. AI Accelerator from Intel compatible with x86/64 architectures. Aimed at edge and multi-cloud deployments either with or without Intel GPUs.

The following is a generic template for uploading models with the Wallaroo SDK.

model = wl.upload_model(name={Model Name}
                        path={Model File Path}
                        framework=Framework.{Wallaroo Framework},
                        arch=wallaroo.engine_config.Architecture.{X86 | ARM}, # defaults to X86
                        accel=wallaroo.engine_config.Acceleration.{Accelerator}) # defaults to None

Deploy and Infer in the Cloud

Models deployed in the cloud in the Wallaroo Ops center take the following steps:

  • Create a pipeline and add the model as a pipeline step.
  • Deploy the model. This allocates resources for the models use.

When deploying a model, the deployment configurations inherits the model’s accelerator and architecture settings. Other settings, such as the number of CPUs, amount of RAM, etc can be changed without modifying the accelerator setting are modified with the deployment configuration; for full details, see Pipeline Deployment Configuration. If no deployment configuration is provided, then the default resource allocations are used.

To change the accelerator settings, models should be re-uploaded as either a new model or a new model version for maximum compatibility with the hardware infrastructure. For more information on uploading models or new model versions, see Model Accelerator at Model Upload.

The following is a generic template for deploying models in Wallaroo with a deployment configuration.

# create the pipeline

pipeline = wl.build_pipeline(pipeline_name={Pipeline Name})

# add the model as a pipeline step

pipeline.add_model_step(model)

# create a deployment configuration

deployment_config = wallaroo.DeploymentConfigBuilder() \
    .cpus({Number of CPUs}) \
    .memory("{Amount of RAM}") \
    .build()

# deploy the model with the deployment configuration
pipeline.deploy(deployment_config = deployment_config)

Deploy in Edge and Multi-cloud Environments

Deploying a model in an Edge and Multi-cloud environment takes two steps after uploading the model:

  • Publish the Model: The model with its AI Accelerator settings is published to an OCI (Open Container Initiative) Registry.
  • Deploy the Model on Edge: The model is deployed in an environment with hardware matching the AI Accelerator and architecture via docker or helm.

Publish the Model

Publishing the pipeline to uses the method [wallaroo.pipeline.publish(deployment_config)].

This requires that the Wallaroo Ops has Edge Registry Services enabled.

The deployment configuration for the pipeline publish inherits the model’s accelerator. Options such as the number of cpus, amount of memory, etc can be adjusted without impacting the model’s accelerator settings.

A deployment configuration must be included with the pipeline publish, even if no changes to the cpus, memory, etc are made. For more detail on deployment configurations, see Wallaroo SDK Essentials Guide: Pipeline Deployment Configuration.

Pipelines do not need to be deployed in the Wallaroo Ops Center before publishing the pipeline. This is useful in multicloud deployments to edge devices with different hardware accelerators than the Wallaroo Ops instance.

To change the model acceleration settings, upload the model as a new model or model version with the new acceleration settings.

For more information, see Wallaroo SDK Essentials Guide: Pipeline Edge Publication.

The Wallaroo SDK publish includes the docker run and the helm run fields. These provide “copy and paste” scripts for deploying the model in edge and multi-cloud environments.

The following template shows publishing the model to the OCI (Open Container Initiative) Registry associated with the Wallaroo Ops Center, and the abbreviated output.

# default deployment configuration
publish = pipeline.publish(deployment_config=deployment_config)
display(publish)
Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is publishing....... Published.
ID87
...(additional rows)
Docker Run Command
docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL={PIPELINE URL} \
    -e CONFIG_CPUS=1 {ENGINE URL}

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Helm Install Command
helm install --atomic $HELM_INSTALL_NAME \
    {HELM CHART REGISTRY URL} \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version {HELM VERSION} \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

Model Deployment on Edge

Models published via the Wallaroo SDK return docker run and helm install template for deploying the model on edge and multi-cloud environments. The docker run commands require the following environmental variables set before executing:

  • $PERSISTENT_VOLUME_DIR (Optional: Used with Edge and Multi-cloud Observability): The directory path for the model deployment’s persistent volume. This stores session and other data for connecting back to the Wallaroo instance for inference logs and other uses.
  • $EDGE_PORT: The network port used to submit inference requests to the deployed model.
  • $OCI_USERNAME: The user name or identifier to authenticate to the OCI (Open Container Initiative) Registry where the model was published.
  • $OCI_PASSWORD: The password or token to authenticate to the OCI (Open Container Initiative) Registry where the model was published.

The following docker run template is returned by Wallaroo during pipeline publish for deploying the model in an edge and multi-cloud environment.

docker run -v $PERSISTENT_VOLUME_DIR:/persist \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL={PIPELINE_URL}:{PIPELINE_VERSION}\
    {Wallaroo_Engine_URL}:{WALLAROO_ENGINE_VERSION}

The following helm install template is returned by Wallaroo during pipeline publish for deploying the model in an edge and multi-cloud environment.

helm install --atomic $HELM_INSTALL_NAME \
    {HELM CHART REGISTRY URL} \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version {HELM VERSION} \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Intel OpenVino Deployment Scenario with Intel GPUs

Deploying ML Models with Intel OpenVINO hardware with Intel GPUs in edge and multi-cloud environments via docker run require additional parameters.

For more details, see:

For ML models deployed on OpenVino hardware with Intel GPUs, docker run must include the following options:

--rm -it --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* ) --ulimit nofile=262144:262144 --cap-add=sys_nice

For example, the following docker run templates demonstrates deploying a Wallaroo published model on OpenVINO hardware with Intel GPUs:

docker run -v $PERSISTENT_VOLUME_DIR:/persist \
    --rm -it --device /dev/dri \
    --group-add=$(stat -c "%g" /dev/dri/render* ) \
    --ulimit nofile=262144:262144 --cap-add=sys_nice \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL={PIPELINE_URL}:{PIPELINE_VERSION} \
    {Wallaroo_Engine_URL}:{WALLAROO_ENGINE_VERSION}

Nvidia Jetson Deployment Scenario

ML models published to OCI registries via the Wallaroo SDK are provided with the Docker Run Command: a sample docker script for deploying the model on edge and multicloud environments. For more details, see Edge and Multicloud Model Publish and Deploy.

To deploy the model to a Jetson device, the following prerequisites on the Jetson target device are required:

For ML models deployed on Jetson accelerated hardware via Docker, the application docker is replace by the nvidia-docker application. For details on installing nvidia-docker, see Installing the NVIDIA Container Toolkit. For example:

nvidia-docker run -v $PERSISTENT_VOLUME_DIR:/persist \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=ghcr.io/wallaroolabs/doc-samples/pipelines/sample-edge-deploy:446aeed9-2d52-47ae-9e5c-f2a05ef0d4d6\
    -e EDGE_BUNDLE=abc123 \
    ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:2024.3

See Nvidia Specialized Configurations with Docker for further details.

Model Accelerator Deployment Troubleshooting

If the specified hardware accelerator or infrastructure is not available in the Wallaroo Ops cluster during deployment, the following error message is displayed:

Tutorials

The following examples are available to demonstrate uploading and publishing models with hardware accelerator support.