Inference with Acceleration Libraries
Table of Contents
Wallaroo supports deploying models that run with hardware accelerators that increase the inference speed and performance.
Hardware accelerators are set during model uploads, which set what hardware is allocated during model deployment in Wallaroo Ops and multicloud deployments.
Prerequisites
The following prerequisites must be met before uploading and deploying models with hardware accelerators.
- Hardware Availability: Hardware accelerators must be available in the environment that the model is deployed in.
- For instructions on adding GPU to Kubernetes clusters, see see Create GPU Nodepools for Kubernetes Clusters.
- For details on adding ARM nodes to a cluster, see Create ARM Nodepools for Kubernetes Clusters.
Supported Accelerators
The following accelerators are supported:
Accelerator | ARM Support | X64/X86 Support | Description |
---|---|---|---|
None | √ | √ | The default acceleration, used for all scenarios and architectures. |
AIO | √ | X | AIO acceleration for Ampere Optimized trained models, only available with ARM processors. |
Jetson | √ | X | Nvidia Jetson acceleration is a CUDA based accelerator used with edge deployments on ARM processors. |
CUDA | √ | √ | Nvidia Cuda acceleration supported by both ARM and X64/X86 processors. This is intended for deployment with GPUs. |
Set Accelerator via the Wallaroo SDK
Accelerators are set during model upload and packaging. To change the accelerator used with a model, re-upload the model with the new accelerator setting for maximum compatibility and support.
Model Accelerator at Model Upload
Models uploaded to Wallaroo have the accelerator set via the wallaroo.client.upload_model
method’s Optional accel: wallaroo.engine_config.Acceleration
parameter. For more details on model uploads and other parameters, see Automated Model Packaging.
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework |
input_schema | pyarrow.lib.Schema
| The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema
| The output schema in Apache Arrow schema format. |
convert_wait | bool (Optional) |
|
arch | wallaroo.engine_config.Architecture (Optional) | The architecture the model is deployed to. If a model is intended for deployment to an ARM architecture, it must be specified during this step. Values include:
|
accel | wallaroo.engine_config.Acceleration (Optional) | The AI hardware accelerator used. If a model is intended for use with a hardware accelerator, it should be assigned at this step.
|
Deploy Models with Accelerator Settings
Models are added to pipeline as pipeline steps. Models are then deployed through the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig] = None)
method.
For full details, see Pipeline Deployment Configuration.
When deploying a model, the deployment configurations inherits the model accelerator setting. No additional changes are needed to set the accelerator when deploying the model. Other settings, such as the number of CPUs, etc can be changed without modifying the accelerator setting.
To change the accelerator settings, models should be re-uploaded as either a new model or a new model version for maximum compatibility with the hardware infrastructure. For more information on uploading models or new model versions, see Model Accelerator at Model Upload.
Model Accelerator for Pipeline Publish
Publishing the pipeline to uses the method wallaroo.pipeline.publish(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig])
.
This requires that the Wallaroo Ops has Edge Registry Services enabled.
The deployment configuration for the pipeline publish inherits the model’s accelerator. Options such as the number of cpus, amount of memory, etc can be adjusted without impacting the model’s accelerator settings.
A deployment configuration must be included with the pipeline publish, even if no changes to the cpus, memory, etc are made. For more detail on deployment configurations, see Wallaroo SDK Essentials Guide: Pipeline Deployment Configuration.
Pipelines do not need to be deployed in the Wallaroo Ops instance before publishing the pipeline. This is useful in multicloud deployments to edge devices with different hardware accelerators than the Wallaroo Ops instance.
To change the model acceleration settings, upload the model as a new model or model version with the new acceleration settings.
For more information, see Wallaroo SDK Essentials Guide: Pipeline Edge Publication.
NOTICE FOR JETSON ACCELERATED MODEL DEPLOYMENT
ML models published to OCI registries via the Wallaroo SDK are provided with the Docker Run Command: a sample docker
script for deploying the model on edge and multicloud environments. For more details, see Edge and Multicloud Model Publish and Deploy.
For ML models deployed on Jetson accelerated hardware via Docker, the application docker
is replace by the nvidia-docker
application. For details on installing nvidia-docker
, see Installing the NVIDIA Container Toolkit. For example:
nvidia-docker run -v $PERSISTENT_VOLUME_DIR:/persist \
-e OCI_USERNAME=$OCI_USERNAME \
-e OCI_PASSWORD=$OCI_PASSWORD \
-e PIPELINE_URL=ghcr.io/wallaroolabs/doc-samples/pipelines/sample-edge-deploy:446aeed9-2d52-47ae-9e5c-f2a05ef0d4d6\
-e EDGE_BUNDLE=abc123 \
ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:2024.1.0-5097
Model Deployment with Accelerators Via the Wallaroo SDK Examples
The following demonstrates the following:
- Upload two Resnet50 models.
- One set to
ARM
with theAIO
accelerator. - One set to
ARM
with theJetson
accelerator.
- One set to
- Create two pipelines and add each model as a pipeline step. Deploy each pipeline, and demonstrate the deployment configuration inherits the model’s acceleration settings.
- Publish each pipeline and demonstrate the deployment configuration for the pipeline publish inherits the model’s acceleration settings.
Resnet 50 Model for ARM and AIO
For this example, we will use a Resnet50 Computer Vision model.
We upload the model and set the architecture to ARM
and the accelerator to AIO
.
model_name_arm = 'computer-vision-resnet50-arm'
model_file_name = './models/frcnn-resnet.pt.onnx'
arm_model = wl.upload_model(model_name_arm,
model_file_name,
framework=Framework.ONNX,
arch=wallaroo.engine_config.Architecture.ARM,
accel=wallaroo.engine_config.Acceleration.AIO)
display(arm_model)
Name | computer-vision-resnet50-arm |
Version | 47743b5f-c88a-4150-a37f-9ad591eb4ee3 |
File Name | frcnn-resnet.pt.onnx |
SHA | 43326e50af639105c81372346fb9ddf453fea0fe46648b2053c375360d9c1647 |
Status | ready |
Image Path | None |
Architecture | arm |
Acceleration | aio |
Updated At | 2024-03-Apr 22:13:40 |
We then build the pipeline, add the model as our model step, and deploy it with a deployment configuration that allocates 1 CPU and 2Gi of RAM. We then show the deployment configuration inherited the model’s acceleration settings.
pipeline_arm = wl.build_pipeline('acceleration-demonstration-arm')
pipeline_arm.clear()
pipeline_arm.add_model_step(arm_model)
deployment_config = wallaroo.DeploymentConfigBuilder() \
.replica_count(1) \
.cpus(1) \
.memory("2Gi") \
.build()
pipeline_arm.deploy(deployment_config = deployment_config)
display(pipeline_arm)
name | acceleration-demonstration-arm |
---|---|
created | 2024-04-01 18:36:26.347071+00:00 |
last_updated | 2024-04-03 22:14:42.912284+00:00 |
deployed | True |
arch | arm |
accel | aio |
tags | |
versions | 18329c99-4b9c-4a15-bc93-42e4d6b93fff, 2f1aa87e-edc2-4af7-8821-00ba54abf18e, 4c8ab1b1-f9c8-49d9-846a-54cad3a18b56, cbc520f2-5755-4f6b-8e89-b4374cb95fdf, 59ff6719-67f1-4359-a6b3-5565b9f6dc09, 39b91147-3b73-4f1a-a25f-500ef648bd6a, 45c0e8ba-b35d-4139-9675-aa7ffcc04dfc, 2d561d88-31f6-43c3-a84d-38cc1cd53fb8, ef9e2394-e29f-46dc-aaa4-eda0a304a71e, fe2b6f05-3623-4440-8258-9e5828bc7eaf, aa86387c-813a-40de-b07a-baf388e20d67 |
steps | computer-vision-resnet50-arm |
published | True |
We now publish the pipeline. Note that the Engine Config
inherited the acceleration from the model.
# default deployment configuration
pub_arm = pipeline_arm.publish(deployment_config=wallaroo.DeploymentConfigBuilder().build())
display(pub_arm)
Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is publishing....... Published.
ID | 87 | |
Pipeline Name | acceleration-demonstration-arm | |
Pipeline Version | 890b56ee-2a0e-4ed1-ae96-c021ca801a7e | |
Status | Published | |
Engine URL | registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-aarch64:v2024.1.0-main-4870 | |
Pipeline URL | registry.example.com/uat/pipelines/acceleration-demonstration-arm:890b56ee-2a0e-4ed1-ae96-c021ca801a7e | |
Helm Chart URL | oci://registry.example.com/uat/charts/acceleration-demonstration-arm | |
Helm Chart Reference | registry.example.com/uat/charts@sha256:15c50483f2010e2691d32d32ded595f20993fa7b043474962b0fa2b509b61510 | |
Helm Chart Version | 0.0.1-890b56ee-2a0e-4ed1-ae96-c021ca801a7e | |
Engine Config | {'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '512Mi'}, 'requests': {'cpu': 1.0, 'memory': '512Mi'}, 'accel': 'aio', 'arch': 'arm', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none'}, 'images': {}}} | |
User Images | [] | |
Created By | john.hummel@wallaroo.ai | |
Created At | 2024-04-03 22:17:03.122597+00:00 | |
Updated At | 2024-04-03 22:17:03.122597+00:00 | |
Replaces | ||
Docker Run Command |
Note: Please set the EDGE_PORT , OCI_USERNAME , and OCI_PASSWORD environment variables. | |
Helm Install Command |
Note: Please set the HELM_INSTALL_NAME , HELM_INSTALL_NAMESPACE ,
OCI_USERNAME , and OCI_PASSWORD environment variables. |
HF Summarizer with CUDA
For this example, we will use a Hugging Face summarizer model. When uploaded, this model is converted to the Wallaroo Containerized Runtime.
We upload the model and set the accelerator to CUDA
.
model = wl.upload_model('hf-summarization',
'./models/hf-summarisation-bart-large-samsun.zip',
framework=Framework.HUGGING_FACE_SUMMARIZATION,
input_schema=input_schema,
output_schema=output_schema,
accel=wallaroo.engine_config.Acceleration.CUDA)
display(model)
Name | hf-summarization |
Version | 47743b5f-c88a-4150-a37f-9ad591eb4ee3 |
File Name | hf-summarisation-bart-large-samsun.zip |
SHA | ee71d066a83708e7ca4a3c07caf33fdc528bb000039b6ca2ef77fa2428dc6268 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.1.0-main-4921 |
Architecture | x86 |
Acceleration | cuda |
Updated At | 2024-03-Apr 22:13:40 |
We then build the pipeline, add the model as our model step, and deploy it with a deployment configuration that allocates 4 cpus, 4 GI RAM and 1 GPU to the Containerized Runtime. We then show the deployment configuration inherited the model’s acceleration settings from the model without being specified
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('1Gi') \
.sidekick_gpus(model, 1) \
.sidekick_cpus(model,4) \
.sidekick_memory(model, '8Gi') \
.deployment_label('wallaroo.ai/accelerator: a100') \
.build()
display(deployment_config)
{'engine': {'cpu': 1,
'resources': {'limits': {'cpu': 1, 'memory': '1Gi'},
'requests': {'cpu': 1, 'memory': '1Gi'}},
'node_selector': 'wallaroo.ai/accelerator: a100',
'arch': 'x86',
'accel': 'cuda'},
'enginelb': {},
'engineAux': {'images': {'hf-summarization-yns-65': {'resources': {'limits': {'nvidia.com/gpu': 1,
'cpu': 4,
'memory': '8Gi'},
'requests': {'nvidia.com/gpu': 1, 'cpu': 4, 'memory': '8Gi'}}}}},
'node_selector': {}}
We now publish the pipeline. Note that the Engine Config
inherited the acceleration from the model.
# default deployment configuration
pub_arm = pipeline.publish(deployment_config)
display(pub_arm)
Waiting for pipeline publish... It may take up to 600 sec.
Pipeline is publishing....... Published.
ID | 1 | |
Pipeline Name | hf-summarization-pipeline | |
Pipeline Version | 6d453276-a4cf-4b01-90d7-78e9da1dd72a | |
Status | Published | |
Engine URL | ghcr.io/wallaroolabs/doc-samples/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2024.1.0-main-4921 | |
Pipeline URL | ghcr.io/wallaroolabs/doc-samples/pipelines/hf-summarization-pipeline:6d453276-a4cf-4b01-90d7-78e9da1dd72a | |
Helm Chart URL | oci://ghcr.io/wallaroolabs/doc-samples/charts/hf-summarization-pipeline | |
Helm Chart Reference | ghcr.io/wallaroolabs/doc-samples/charts@sha256:a9406689f7429c16758447780c860ee41c78dc674280754eb2b377da1a9efbf4 | |
Helm Chart Version | 0.0.1-6d453276-a4cf-4b01-90d7-78e9da1dd72a | |
Engine Config | {'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '512Mi'}, 'requests': {'cpu': 1.0, 'memory': '512Mi'}, 'accel': 'cuda', 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none'}, 'images': {}}} | |
User Images | [] | |
Created By | john.hansarick@wallaroo.ai | |
Created At | 2024-04-17 19:49:32.922418+00:00 | |
Updated At | 2024-04-17 19:49:32.922418+00:00 | |
Replaces | ||
Docker Run Command |
Note: Please set the EDGE_PORT , OCI_USERNAME , and OCI_PASSWORD environment variables. | |
Helm Install Command |
Note: Please set the HELM_INSTALL_NAME , HELM_INSTALL_NAMESPACE ,
OCI_USERNAME , and OCI_PASSWORD environment variables. |
Model Accelerator Deployment Troubleshooting
If the specified hardware accelerator or infrastructure is not available in the Wallaroo Ops cluster during deployment, the following error message is displayed:
Tutorials
The following examples are available to demonstrate uploading and publishing models with hardware accelerator support.