Inference with NVIDIA GPUs
Table of Contents
ML models uploaded to Wallaroo can be deployed with or without GPUs by specifying the deployment configuration. The following is a condensed guide for deploying ML models with GPU support through the Wallaroo SDK. For full details on model deployment configurations, see the Model Deploy guide.
Model Deployments Prerequisites for GPU Support
Before deploying a ML model with GPU support, a nodepool with VMs with GPUs must be available. The Platform Administration Guide Create GPU Nodepools for Kubernetes Clusters details how to create nodepools with GPU support for different clouds.
The nodepool with GPU must include the following Kubernetes taints and labels.
| Taint | Label | 
|---|---|
| wallaroo.ai/pipelines=true:NoSchedule | wallaroo.ai/node-purpose: pipelines{custom label}, for example:wallaroo/gpu:true | 
The custom label is required for deploying ML models with GPU support; this allows Wallaroo to select the correct nodepool that has the GPU hardware the ML model requires. For more details on Kubernetes taints and labels for ML model deployment in Wallaroo, see the Taints and Tolerations Guide.
Model Deployment for GPU Support
ML model deployments in Wallaroo via the Wallaroo SDK use the following steps:
Create Deployment Configuration
Setting a deployment configuration follows this process:
- Pipeline deployment configurations are created through the wallaroo.deployment_config.DeploymentConfigBuilder()class.
- Once the configuration options are set, the deployment configuration is finalized with the wallaroo.deployment_config.DeploymentConfigBuilder().build()method.
- The deployment configuration is applied when applied during model deployment via the wallaroo.pipeline.Pipeline.deploymethod.
Create GPU Deployment Configuration
Deployment configurations with GPU support are created through the wallaroo.deployment_config.DeploymentConfigBuilder() class.
GPUs can only be allocated by entire integer units from the GPU enabled nodepools. gpus(1), gpus(2), etc are valid values, while gpus(0.25) are not.
Organizations should be aware of how many GPUs are allocated to the cluster. If all GPUs are already allocated to other deployments, or if there are not enough GPUs to fulfill the request, the deployment will fail and return an error message.
GPU Support
Wallaroo 2023.2.1 and above supports Kubernetes nodepools with NVIDIA CUDA GPUs.
See the Create GPU Nodepools for Kubernetes Clusters guide for instructions on adding GPU enabled nodepools to a Kubernetes cluster.
IMPORTANT NOTE
If allocating GPUs to a Wallaroo pipeline, the deployment_label configuration option must be used. For example:
import wallaroo
# create the deployment configuration with 4 cpus, 3 Gi RAM, and 1 GPU with the deployment label
wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .build()
Deployment configurations default to the following*.
| Runtime | CPUs | Memory | GPUs | 
|---|---|---|---|
| Wallaroo Native Runtime** | 4 | 3 Gi | 0 | 
| Wallaroo Containerized Runtime*** | 2 | 1 Gi | 0 | 
*: For Kubernetes limits and requests.
**: Resources are always allocated for the Wallaroo Native Runtime engine even if there are no Wallaroo Native Runtimes included in the deployment, so it is recommended to decrease these resources when pipelines use Containerized Runtimes.
***: Resources for Wallaroo Containerized Runtimes only apply with a Wallaroo Containerized Runtime is part of the deployment.
The DeploymentConfigBuilder takes the following methods as arguments to specify what resources are allocated. Note that there are two separate runtimes based on the model:
- Wallaroo Native Runtime: ML models including ONNX and TensorFlow that are deployed natively to Wallaroo. Models deployed in this runtime share all resources allocated to the runtime.
- Wallaroo Containerized Runtime: ML models including Hugging Face, BYOP (Bring Your Own Predict), etc. These models are specified with the number of GPUs, amount of RAM, and other values.
For more details on ML model upload and packaging with Wallaroo, see the Model Upload guide
The following represents the essential parameters for setting the deployment configuration with GPU enabled ML models. For full details on setting deployment configurations for ML mode deployment, see the Deployment Configuration guide.
| Method | Parameters | Description | Runtime | 
|---|---|---|---|
| deployment_label | (label: string) | Label used to match the nodepool label used for the deployment. Required if gpusare set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo. | Native and Containerized | 
| gpus | (core_count: int) | Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If gpusis called, then thedeployment_labelmust be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance. | Native | 
| sidekick_gpus | (model: wallaroo.model.Model, core_count: int) | Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message will be returned during deployment. If called, then the deployment_labelmust be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance | Containerized | 
| cpus | (core_count: float) | Sets the number or fraction of CPUs to use for the deployment, for example: 0.25,1,1.5, etc. The units are similar to the Kubernetes CPU definitions. | Native | 
| memory | (memory_spec: str) | Sets the amount of RAM to allocate the deployment. The memory_specstring is in the format “{size as number}{unit value}”. The accepted unit values are:
 | Native | 
| sidekick_cpus | (model: wallaroo.model.Model, core_count: float) | Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows: 
 | Containerized | 
| sidekick_memory | (model: wallaroo.model.Model, memory_spec: str) | Sets the memory available to for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows: 
 | Containerized | 
Once the configuration options are set, the deployment configuration is finalized with the wallaroo.deployment_config.DeploymentConfigBuilder().build() method.
Deploy ML Model with GPUs Example
The following demonstrates deploying a ML model with GPU support through Wallaroo.
This assumes that the model has already been uploaded. We retrieve it via the wallaroo.client.Client.get_model method, which retrieves the model based on its name.
import wallaroo
# create and save the client connection to Wallaroo
wl = wallaroo.Client()
# retrieve the model reference
gpu_model = wl.get_model('sample-gpu-model')
The model is added as a pipeline step to a Wallaroo pipeline. For details, see the Model Deploy guide.
# create the Wallaroo pipeline
pipeline = wl.build_pipeline('sample-gpu-pipeline')
# add the ML model as a pipeline step
pipeline.add_model_step(gpu_model)
Depending on the Wallaroo Runtime the model is deployed in, the following deployments could be used. Each will assign a GPU to the appropriate runtime.
All models deployed in the Wallaroo Native Runtime are assigned the same resources. The following example shows deploying the ML model saved to the gpu_model reference.
For this example, the deployment will assign:
- Wallaroo Native Runtime- 4 CPUs
- 3 GI RAM
- 1 GPU
 
Note that the deployment_label must match the Kubernetes label assigned to the nodepool containing the GPU enabled VMs.
gpu_deployment_configuration = wallaroo.DeploymentConfigBuilder()
                                .cpus(4) \
                                .memory('3Gi') \
                                .gpus(1) \
                                .deployment_label('doc-gpu-label:true') \
                                .build()
All models deployed in the Wallaroo Containerized Runtime are assigned specific resources. The following example shows creating a deployment configuration to a model referenced to the gpu_model variable.
For this example, the deployment will assign a minimum deployment for the Wallaroo Native Runtime, since no models will be deployed to that environment, and set the GPU for the ML model deployed in the Containerized Runtime environment.
- Wallaroo Native Runtime- 0.25 CPU
- 1 Gi RAM
- 0 GPUs
 
- Wallaroo Containerized Runtime- 1 GPU
- 4 CPUs
- 3 GI RAM
 
Note that the deployment_label must match the Kubernetes label assigned to the nodepool containing the GPU enabled VMs.
gpu_deployment_configuration = wallaroo.DeploymentConfigBuilder() \
                                .cpus(0.25) \
                                .memory('1Gi') \
                                .gpus(0) \
                                .deployment_label('doc-gpu-label:true') \
                                .sidekick_gpus(gpu_model, 1) \
                                .sidekick_cpus(gpu_model, 4) \
                                .sidekick_memory(gpu_model, '3Gi') \
                                .build()
With the deployment configuration finalized, we deploy the model with the wallaroo.pipeline.Pipeline.deploy(deployment_configuration) method, using the deployment configuration gpu_deployment_configuration. This allocates the specified resources from the deployment configuration for the ML model. Note that if the resources are not available, this request will return an error.
pipeline.deploy(gpu_deployment_configuration)
Once the deployment is complete, the ML model is ready for inference requests. For more details on submitting inference requests on deployed ML Models, see Model Inference.