Create GPU Nodepools for Kubernetes Clusters

How to create GPU nodepools for Kubernetes clusters.

Wallaroo provides support for ML models that use GPUs. The following templates demonstrate how to create a nodepool in different cloud providers, then assign that nodepool to an existing cluster. These steps can be used in conjunction with Wallaroo Enterprise Install Guides.

Note that deploying pipelines with GPU support is only available for Wallaroo Enterprise.

The following script creates a nodepool with NVidia Tesla K80 gpu using the Standard_NC6 machine type and autoscales from 0-3 nodes. Each node has one GPU in this example so the max .gpu() that can be requested by a pipeline step is 1.

For detailed steps on adding GPU to a cluster, see Microsoft Azure Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS) guide.

Note that the labels are required as part of the Wallaroo pipeline deployment with GPU support

RESOURCE_GROUP="YOUR RESOURCE GROUP"
CLUSTER_NAME="YOUR CLUSTER NAME"
GPU_NODEPOOL_NAME="YOUR GPU NODEPOOL NAME"

az extension add --name aks-preview

az extension update --name aks-preview

az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"

az provider register -n Microsoft.ContainerService

az aks nodepool add \                
    --resource-group $RESOURCE_GROUP \
    --cluster-name $CLUSTER_NAME \
    --name $GPU_NODEPOOL_NAME \
    --node-count 0 \
    --node-vm-size Standard_NC6 \
    --node-taints sku=gpu:NoSchedule \
    --aks-custom-headers UseGPUDedicatedVHD=true \
    --enable-cluster-autoscaler \
    --min-count 0 \
    --max-count 3 \
    --labels doc-gpu-label=true

The following script creates a nodepool uses NVidia T4 GPUs and autoscales from 0-3 nodes. Each node has one GPU in this example so the max .gpu() that can be requested by a pipeline step is 1.

Google GKE automatically adds the following taint to the created nodepool.

NO_SCHEDULE nvidia.com/gpu present

Note that the labels are required as part of the Wallaroo pipeline deployment with GPU support

GCP_PROJECT="YOUR GCP PROJECT"
GCP_CLUSTER="YOUR CLUSTER NAME"
GPU_NODEPOOL_NAME="YOUR GPU NODEPOOL NAME"
REGION="YOUR REGION"

gcloud beta container \
    --project $GCP_PROJECT \
    node-pools create $GPU_NODEPOOL_NAME \
    --cluster $GCP_CLUSTER \
    --region $REGION \
    --node-version "1.25.8-gke.500" \
    --machine-type "n1-standard-1" \
    --accelerator "type=nvidia-tesla-t4,count=1" \
    --image-type "COS_CONTAINERD" \
    --disk-type "pd-balanced" \
    --disk-size "100" \
    --node-labels doc-gpu-label=true \
    --metadata disable-legacy-endpoints=true \
    --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
    --num-nodes "3" \
    --enable-autoscaling \
    --min-nodes "0" \
    --max-nodes "3" \
    --location-policy "BALANCED" \
    --enable-autoupgrade \
    --enable-autorepair \
    --max-surge-upgrade 1 \
    --max-unavailable-upgrade 0

The following steps are used to create a AWS EKS Nodepool with GPU nodes.

  • Prerequisites: An existing AWS (Amazon Web Service) EKS (Elastic Kubernetes Service) cluster. See Wallaroo Enterprise Comprehensive Install Guide: Environment Setup Guides for a sample creation of an AWS EKS cluster for hosting a Wallaroo Enterprise instance.
  • eksctl: Command line tool for installating and updating EKS clusters.
  • Administrator access to the EKS cluster and capabilty of running kubectl commands.
  1. Create the nodepool with the following configuration file. Note that the labels are required as part of the Wallaroo pipeline deployment with GPU support. The sample configuration file below uses the AWS instance type g5.2xlarge. Modify as required.

    eksctl create nodegroup --config-file=<path>
    

    Sample config file:

# aws-gpu-nodepool.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: YOUR CLUSTER NAME HERE # This must match the name of the existing cluster
  region: YOUR REGION HERE

managedNodeGroups:
- name: YOUR NODEPOOL NAME HERE
  instanceType: g5.2xlarge
  minSize: 1
  maxSize: 3
  labels:
    wallaroo.ai/gpu: "true"
    doc-gpu-label: "true"
  taints:
    - key: wallaroo.ai/engine
      value: "true"
      effect: NoSchedule
  tags:
    k8s.io/cluster-autoscaler/node-template/label/k8s.dask.org/node-purpose: engine
    k8s.io/cluster-autoscaler/node-template/taint/k8s.dask.org/dedicated: "true:NoSchedule"
  iam:
    withAddonPolicies:
      autoScaler: true
  containerRuntime: containerd
  amiFamily: AmazonLinux2
  availabilityZones:
    - INSERT YOUR ZONE HERE
  volumeSize: 100