Large Language Models Infrastructure Requirements

The following details how to set up the Kubernetes infrastructure for Large Language Model (LLM) packaging and deployments.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

For LLMs, the main infrastructure considerations are:

Ephemeral Storage: Ephemeral storage is the RAM needed for the initial LLM upload and packaging for deployment. Typically this is assigned to the nodepool labeled general.
- Recommended RAM: 250 Gi.
Instance Type with GPU and AI Accelerators: The nodes the LLMs are deployed to typically GPUs and other AI accelerators to inference performance in near real time.

The following configuration options are based on the Wallaroo Infrastructure Configuration Guides with modifications for the heavy requirements LLMs can bring.

For full details on setting up a Kubernetes cluster and installing Wallaroo, see the Wallaroo Install guides.

Nodepools Explained

The Kubernetes cluster hosting the Wallaroo instance typically has the following nodepools:

general: The general nodepool is where most Wallaroo services run, including:
- Model packaging
- Wallaroo dashboard services
Other nodepools: Other nodepools can be configured to run specific Wallaroo services. For deploying LLM models, the provided Cloud Configurations detail how to create specific nodepools optimized for LLM model deployments.

Quantized LLM Nodepools Required Taints and Tolerations

Nodepools hosting LLM deployments require the following Kubernetes taints and labels.

Taint	Label
`wallaroo.ai/pipelines=true:NoSchedule`	`wallaroo.ai/node-purpose: pipelines`

For the examples provided in Cloud Configurations include these taints as part of the configuration details.

GPU Enabled LLM Nodepool Taints and Labels

Nodepools set up for LLM model deployment include a the following taints and labels to ensure the models are deployed to the correct nodepool and provide them with the best resources for their service requirements. For nodepools with GPUs, custom deployment labels are a required part of the model’s deployment configuration.

Taint	Label
`wallaroo.ai/pipelines=true:NoSchedule`	`wallaroo.ai/node-purpose: pipelines` (Required) {custom-label} At least one custom label is required. For example: `wallaroo/gpu:true`

For the examples provided in Cloud Configurations include sample labels as part of the configuration details.

Cloud Configurations

The following details how to configure a Wallaroo installation for different cloud platforms for LLM deployments. For the general nodepool, these are modifications to the Wallaroo Install guides, and are best performed before installing Wallaroo.

The GPU based nodepools are added at any time. It is recommended to add them during the initial install process. Note that GPU nodepools requires two labels:

wallaroo.ai/node-purpose: pipelines (Required)
Any custom label; these are used for deploying LLMs to specify the nodepool to used.

For details on how to set deployment configurations for model deployment, see the Wallaroo Deployment Configuration guide.

Amazon Web Services

For deployments of LLMs in Amazon Web Services (AWS), the following configuration is recommended. Modify depending on your particular requirements.

Ephemeral Storage: 250 GB
Recommended Instance Type: P3.16xlarge

The following GPU nodepool and general configurations are for Amazon eksctl deployments.

AWS GPU Nodepool Sample

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: YOUR CLUSTER NAME HERE # This must match the name of the existing cluster
  region: YOUR REGION HERE

managedNodeGroups:
- name: YOUR NODEPOOL NAME HERE
  instanceType: p3.16xlarge
  minSize: 0
  maxSize: 1
  labels:
    wallaroo.ai/node-purpose: "pipelines" # required label
    wallaroo.ai/accelerator: "a100" # custom label - at least one custom label is required
  taints:
    - key: wallaroo.ai/pipelines
      value: "true"
      effect: NoSchedule
  tags:	
    k8s.io/cluster-autoscaler/node-template/label/k8s.dask.org/node-purpose: pipelines
    k8s.io/cluster-autoscaler/node-template/taint/k8s.dask.org/dedicated: "true:NoSchedule"
  iam:
    withAddonPolicies:
      autoScaler: true
  containerRuntime: containerd
  amiFamily: AmazonLinux2
  availabilityZones:
    - INSERT YOUR ZONE HERE
  volumeSize: 100

AWS General Nodepool Sample

- name: general
    instanceType: m5.2xlarge
    desiredCapacity: 3
    volumeSize: 250
    containerRuntime: containerd
    amiFamily: AmazonLinux2
    availabilityZones:
      - us-east-1a
  labels:
    wallaroo.ai/node-purpose: general

Microsoft Azure

For deployments of LLMs in Microsoft Azure, the following configuration is recommended. Modify depending on your particular requirements.

Ephemeral Storage: 250 GB
Recommended Instance Type: NC24ADS-v4

The following GPU nodepool and mainpool configurations are based on using the Azure Command-Line Interface (CLI).

Azure GPU Nodepool Sample

RESOURCE_GROUP="YOUR RESOURCE GROUP"

CLUSTER_NAME="YOUR CLUSTER NAME"
GPU_NODEPOOL_NAME="YOUR GPU NODEPOOL NAME"

az extension add --name aks-preview

az extension update --name aks-preview

az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"

az provider register -n Microsoft.ContainerService

az aks nodepool add \                
    --resource-group $RESOURCE_GROUP \
    --cluster-name $CLUSTER_NAME \
    --name $GPU_NODEPOOL_NAME \
    --node-count 0 \
    --node-vm-size Standard_NC24ads_A100_v4\
    --node-taints wallaroo.ai/pipelines=true:NoSchedule \
    --aks-custom-headers UseGPUDedicatedVHD=true \
    --enable-cluster-autoscaler \
    --min-count 0 \
    --max-count 1 \
    --labels wallaroo.ai/node-purpose=pipelines {add custom label here} # node-purpose is a required label; custom label - at least one custom label is required

Azure Mainpool Nodepool Sample

az aks create \
--resource-group $WALLAROO_RESOURCE_GROUP \
--name $WALLAROO_CLUSTER \
--node-count 3 \
--generate-ssh-keys \
--vm-set-type VirtualMachineScaleSets \
--load-balancer-sku standard \
--node-vm-size $WALLAROO_VM_SIZE \
--node-osdisk-size 250
--nodepool-name general \
--attach-acr $WALLAROO_CONTAINER_REGISTRY \
--kubernetes-version=1.30 \
--zones 1 \
--location $WALLAROO_GROUP_LOCATION
--nodepool-labels wallaroo.ai/node-purpose=general

Google Cloud Platform

Ephemeral Storage: 250 GB
Recommended Instance Type:
- a2-ultragpu-1g
- a2-ultragpu-1g

The following GPU nodepool and mainpool configurations are based on using the Google gcloud Command Line Interface (CLI).

GCP GPU Nodepool Sample

Before setting up the GCP GPU nodepool, install the Nvidia drivers to the Kubernetes cluster with the following command.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

GCP_PROJECT="YOUR GCP PROJECT"
GCP_CLUSTER="YOUR CLUSTER NAME"
GPU_NODEPOOL_NAME="YOUR GPU NODEPOOL NAME"
REGION="YOUR REGION"

gcloud container \
    --project $GCP_PROJECT \
    node-pools create $GPU_NODEPOOL_NAME \
    --cluster $GCP_CLUSTER \
    --region $REGION \
    --node-version "1.25.8-gke.500" \
    --machine-type "a2-ultragpu-1g" \
    --accelerator "type=nvidia-tesla-a100,count=1, gpu-driver-version=default" \
    --image-type "COS_CONTAINERD" \
    --disk-type "pd-balanced" \
    --disk-size "100" \
    --node-labels wallaroo.ai/node-purpose=pipelines {add custom label here} \ # node-purpose is a required label; custom label - at least one custom label is required
    --node-taints=wallaroo.ai/pipelines=true:NoSchedule \
    --metadata disable-legacy-endpoints=true \
    --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
    --num-nodes "1" \
    --enable-autoscaling \
    --min-nodes "0" \
    --max-nodes "1" \
    --location-policy "BALANCED" \
    --enable-autoupgrade \
    --enable-autorepair \
    --max-surge-upgrade 1 \
    --max-unavailable-upgrade 0

GCP Mainpool Nodepool Sample

gcloud container clusters \
create $WALLAROO_CLUSTER \
--region $WALLAROO_GCP_REGION \
--node-locations $WALLAROO_NODE_LOCATION \
--machine-type $DEFAULT_VM_SIZE \
--disk-size 250 \
--network $WALLAROO_GCP_NETWORK_NAME \
--create-subnetwork name=$WALLAROO_GCP_SUBNETWORK_NAME \
--enable-ip-alias \
--labels=wallaroo.ai/node-purpose=general \
--cluster-version=1.23