Inference with GPUs

How to use package models to run on GPUs

Table of Contents

ML models uploaded to Wallaroo can be deployed with or without GPUs by specifying the deployment configuration. The following is a condensed guide for deploying ML models with GPU support through the Wallaroo SDK. For full details on model deployment configurations, see the Model Deploy guide.

Model Deployments Prerequisites for GPU Support

Before deploying a ML model with GPU support, a nodepool with VMs with GPUs must be available. The Platform Administration Guide Create GPU Nodepools for Kubernetes Clusters details how to create nodepools with GPU support for different clouds.

The nodepool with GPU must include the following Kubernetes taints and labels.

TaintLabel
wallaroo.ai/pipelines=true:NoSchedulewallaroo.ai/node-purpose: pipelines
{custom label}, for example: wallaroo/gpu:true

The custom label is required for deploying ML models with GPU support; this allows Wallaroo to select the correct nodepool that has the GPU hardware the ML model requires. For more details on Kubernetes taints and labels for ML model deployment in Wallaroo, see the Taints and Tolerations Guide.

Model Deployment for GPU Support

ML model deployments in Wallaroo via the Wallaroo SDK use the following steps:

Create Deployment Configuration

Setting a deployment configuration follows this process:

  1. Pipeline deployment configurations are created through the wallaroo.deployment_config.DeploymentConfigBuilder() class.
  2. Once the configuration options are set, the deployment configuration is finalized with the wallaroo.deployment_config.DeploymentConfigBuilder().build() method.
  3. The deployment configuration is applied when applied during model deployment via the wallaroo.pipeline.Pipeline.deploy method.

Create GPU Deployment Configuration

Deployment configurations with GPU support are created through the wallaroo.deployment_config.DeploymentConfigBuilder() class.

GPUs can only be allocated by entire integer units from the GPU enabled nodepools. gpus(1), gpus(2), etc are valid values, while gpus(0.25) are not.

Organizations should be aware of how many GPUs are allocated to the cluster. If all GPUs are already allocated to other deployments, or if there are not enough GPUs to fulfill the request, the deployment will fail and return an error message.

GPU Support

Wallaroo 2023.2.1 and above supports Kubernetes nodepools with Nvidia Cuda GPUs.

See the Create GPU Nodepools for Kubernetes Clusters guide for instructions on adding GPU enabled nodepools to a Kubernetes cluster.

Deployment configurations default to the following*.

RuntimeCPUsMemoryGPUs
Wallaroo Native Runtime**43 Gi0
Wallaroo Containerized Runtime***21 Gi0

*: For Kubernetes limits and requests.
**: Resources are always allocated for the Wallaroo Native Runtime engine even if there are no Wallaroo Native Runtimes included in the deployment, so it is recommended to decrease these resources when pipelines use Containerized Runtimes.
***: Resources for Wallaroo Containerized Runtimes only apply with a Wallaroo Containerized Runtime is part of the deployment.

The DeploymentConfigBuilder takes the following methods as arguments to specify what resources are allocated. Note that there are two separate runtimes based on the model:

  • Wallaroo Native Runtime: ML models including ONNX and TensorFlow that are deployed natively to Wallaroo. Models deployed in this runtime share all resources allocated to the runtime.
  • Wallaroo Containerized Runtime: ML models including Hugging Face, BYOP (Bring Your Own Predict), etc. These models are specified with the number of GPUs, amount of RAM, and other values.

For more details on ML model upload and packaging with Wallaroo, see the Model Upload guide

The following represents the essential parameters for setting the deployment configuration with GPU enabled ML models. For full details on setting deployment configurations for ML mode deployment, see the Deployment Configuration guide.

MethodParametersDescriptionRuntime
deployment_label(label: string)Label used to match the nodepool label used for the deployment. Required if gpus are set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo.Native and Containerized
gpus(core_count: int)Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If gpus is called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance.Native
sidekick_gpus(model: wallaroo.model.Model, core_count: int)Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message will be returned during deployment. If called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instanceContainerized
cpus(core_count: float)Sets the number or fraction of CPUs to use for the deployment, for example: 0.25, 1, 1.5, etc. The units are similar to the Kubernetes CPU definitions.Native
memory(memory_spec: str)Sets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
  • KiB (for KiloBytes)
  • MiB (for MegaBytes)
  • GiB (for GigaBytes)
  • TiB (for TeraBytes)
The values are similar to the Kubernetes memory resource units format.
Native
sidekick_cpus(model: wallaroo.model.Model, core_count: float)Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
  • Model model: The sidekick model to configure.
  • float core_count: Number of CPU cores to use in this sidekick.
Containerized
sidekick_memory(model: wallaroo.model.Model, memory_spec: str)Sets the memory available to for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
  • Model model: The sidekick model to configure.
  • memory_spec: The amount of memory to allocated as memory unit values. The accepted unit values are:
    • KiB (for KiloBytes)
    • MiB (for MegaBytes)
    • GiB (for GigaBytes)
    • TiB (for TeraBytes)
    The values are similar to the Kubernetes memory resource units format.
Containerized

Once the configuration options are set, the deployment configuration is finalized with the wallaroo.deployment_config.DeploymentConfigBuilder().build() method.

Deploy ML Model with GPUs Example

The following demonstrates deploying a ML model with GPU support through Wallaroo.

This assumes that the model has already been uploaded. We retrieve it via the wallaroo.client.Client.get_model method, which retrieves the model based on its name.

import wallaroo

# create and save the client connection to Wallaroo
wl = wallaroo.Client()

# retrieve the model reference
gpu_model = wl.get_model('sample-gpu-model')

The model is added as a pipeline step to a Wallaroo pipeline. For details, see the Model Deploy guide.

# create the Wallaroo pipeline

pipeline = wl.build_pipeline('sample-gpu-pipeline')

# add the ML model as a pipeline step

pipeline.add_model_step(gpu_model)

Depending on the Wallaroo Runtime the model is deployed in, the following deployments could be used. Each will assign a GPU to the appropriate runtime.

All models deployed in the Wallaroo Native Runtime are assigned the same resources. The following example shows deploying the ML model saved to the gpu_model reference.

For this example, the deployment will assign:

  • Wallaroo Native Runtime
    • 4 CPUs
    • 3 GI RAM
    • 1 GPU

Note that the deployment_label must match the Kubernetes label assigned to the nodepool containing the GPU enabled VMs.

gpu_deployment_configuration = wallaroo.DeploymentConfigBuilder()
                                .cpus(4) \
                                .memory('3Gi') \
                                .gpus(1) \
                                .deployment_label('doc-gpu-label:true') \
                                .build()

All models deployed in the Wallaroo Containerized Runtime are assigned specific resources. The following example shows creating a deployment configuration to a model referenced to the gpu_model variable.

For this example, the deployment will assign a minimum deployment for the Wallaroo Native Runtime, since no models will be deployed to that environment, and set the GPU for the ML model deployed in the Containerized Runtime environment.

  • Wallaroo Native Runtime
    • 0.25 CPU
    • 1 Gi RAM
    • 0 GPUs
  • Wallaroo Containerized Runtime
    • 1 GPU
    • 4 CPUs
    • 3 GI RAM

Note that the deployment_label must match the Kubernetes label assigned to the nodepool containing the GPU enabled VMs.

gpu_deployment_configuration = wallaroo.DeploymentConfigBuilder() \
                                .cpus(0.25) \
                                .memory('1Gi') \
                                .gpus(0) \
                                .deployment_label('doc-gpu-label:true') \
                                .sidekick_gpus(gpu_model, 1) \
                                .sidekick_cpus(gpu_model, 4) \
                                .sidekick_memory(gpu_model, '3Gi') \
                                .build()

With the deployment configuration finalized, we deploy the model with the wallaroo.pipeline.Pipeline.deploy(deployment_configuration) method, using the deployment configuration gpu_deployment_configuration. This allocates the specified resources from the deployment configuration for the ML model. Note that if the resources are not available, this request will return an error.

pipeline.deploy(gpu_deployment_configuration)

Once the deployment is complete, the ML model is ready for inference requests. For more details on submitting inference requests on deployed ML Models, see Model Inference.