Large Language Models Infrastructure Requirements


Table of Contents

The following details how to set up the Kubernetes infrastructure for Large Language Model (LLM) packaging and deployments.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

For LLMs, the main infrastructure considerations are:

  • Ephemeral Storage: Ephemeral storage is the RAM needed for the initial LLM upload and packaging for deployment. Typically this is assigned to the nodepool labeled general.
    • Recommended RAM: 250 Gi.
  • Instance Type with GPU and AI Accelerators: The nodes the LLMs are deployed to typically GPUs and other AI accelerators to inference performance in near real time.

The following configuration options are based on the Wallaroo Infrastructure Configuration Guides with modifications for the heavy requirements LLMs can bring.

For full details on setting up a Kubernetes cluster and installing Wallaroo, see the Wallaroo Install guides.

Nodepools Explained

The Kubernetes cluster hosting the Wallaroo instance typically has the following nodepools:

  • general: The general nodepool is where most Wallaroo services run, including:
    • Model packaging
    • Wallaroo dashboard services
  • Other nodepools: Other nodepools can be configured to run specific Wallaroo services. For deploying LLM models, the provided Cloud Configurations detail how to create specific nodepools optimized for LLM model deployments.

Quantized LLM Nodepools Required Taints and Tolerations

Nodepools hosting LLM deployments require the following Kubernetes taints and labels.

TaintLabel
wallaroo.ai/pipelines=true:NoSchedulewallaroo.ai/node-purpose: pipelines

For the examples provided in Cloud Configurations include these taints as part of the configuration details.

GPU Enabled LLM Nodepool Taints and Labels

Nodepools set up for LLM model deployment include a the following taints and labels to ensure the models are deployed to the correct nodepool and provide them with the best resources for their service requirements. For nodepools with GPUs, custom deployment labels are a required part of the model’s deployment configuration.

TaintLabel
wallaroo.ai/pipelines=true:NoSchedule
    • wallaroo.ai/node-purpose: pipelines (Required)
    • {custom-label}
    At least one custom label is required. For example: wallaroo/gpu:true
  • For the examples provided in Cloud Configurations include sample labels as part of the configuration details.

    Cloud Configurations

    The following details how to configure a Wallaroo installation for different cloud platforms for LLM deployments. For the general nodepool, these are modifications to the Wallaroo Install guides, and are best performed before installing Wallaroo.

    The GPU based nodepools are added at any time. It is recommended to add them during the initial install process. Note that GPU nodepools requires two labels:

    • wallaroo.ai/node-purpose: pipelines (Required)
    • Any custom label; these are used for deploying LLMs to specify the nodepool to used.

    For details on how to set deployment configurations for model deployment, see the Wallaroo Deployment Configuration guide.

    Amazon Web Services

    For deployments of LLMs in Amazon Web Services (AWS), the following configuration is recommended. Modify depending on your particular requirements.

    • Ephemeral Storage: 250 GB
    • Recommended Instance Type: P3.16xlarge

    The following GPU nodepool and general configurations are for Amazon eksctl deployments.

    AWS GPU Nodepool Sample

    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    
    metadata:
      name: YOUR CLUSTER NAME HERE # This must match the name of the existing cluster
      region: YOUR REGION HERE
    
    managedNodeGroups:
    - name: YOUR NODEPOOL NAME HERE
      instanceType: p3.16xlarge
      minSize: 0
      maxSize: 1
      labels:
        wallaroo.ai/node-purpose: "pipelines" # required label
        wallaroo.ai/accelerator: "a100" # custom label - at least one custom label is required
      taints:
        - key: wallaroo.ai/pipelines
          value: "true"
          effect: NoSchedule
      tags:	
        k8s.io/cluster-autoscaler/node-template/label/k8s.dask.org/node-purpose: pipelines
        k8s.io/cluster-autoscaler/node-template/taint/k8s.dask.org/dedicated: "true:NoSchedule"
      iam:
        withAddonPolicies:
          autoScaler: true
      containerRuntime: containerd
      amiFamily: AmazonLinux2
      availabilityZones:
        - INSERT YOUR ZONE HERE
      volumeSize: 100

    AWS General Nodepool Sample

    - name: general
        instanceType: m5.2xlarge
        desiredCapacity: 3
        volumeSize: 250
        containerRuntime: containerd
        amiFamily: AmazonLinux2
        availabilityZones:
          - us-east-1a
      labels:
        wallaroo.ai/node-purpose: general
    

    Microsoft Azure

    For deployments of LLMs in Microsoft Azure, the following configuration is recommended. Modify depending on your particular requirements.

    • Ephemeral Storage: 250 GB
    • Recommended Instance Type: NC24ADS-v4

    The following GPU nodepool and mainpool configurations are based on using the Azure Command-Line Interface (CLI).

    Azure GPU Nodepool Sample

    RESOURCE_GROUP="YOUR RESOURCE GROUP"
    
    CLUSTER_NAME="YOUR CLUSTER NAME"
    GPU_NODEPOOL_NAME="YOUR GPU NODEPOOL NAME"
    
    az extension add --name aks-preview
    
    az extension update --name aks-preview
    
    az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
    
    az provider register -n Microsoft.ContainerService
    
    az aks nodepool add \                
        --resource-group $RESOURCE_GROUP \
        --cluster-name $CLUSTER_NAME \
        --name $GPU_NODEPOOL_NAME \
        --node-count 0 \
        --node-vm-size Standard_NC24ads_A100_v4\
        --node-taints wallaroo.ai/pipelines=true:NoSchedule \
        --aks-custom-headers UseGPUDedicatedVHD=true \
        --enable-cluster-autoscaler \
        --min-count 0 \
        --max-count 1 \
        --labels wallaroo.ai/node-purpose=pipelines {add custom label here} # node-purpose is a required label; custom label - at least one custom label is required
    

    Azure Mainpool Nodepool Sample

    az aks create \
    --resource-group $WALLAROO_RESOURCE_GROUP \
    --name $WALLAROO_CLUSTER \
    --node-count 3 \
    --generate-ssh-keys \
    --vm-set-type VirtualMachineScaleSets \
    --load-balancer-sku standard \
    --node-vm-size $WALLAROO_VM_SIZE \
    --node-osdisk-size 250
    --nodepool-name general \
    --attach-acr $WALLAROO_CONTAINER_REGISTRY \
    --kubernetes-version=1.30 \
    --zones 1 \
    --location $WALLAROO_GROUP_LOCATION
    --nodepool-labels wallaroo.ai/node-purpose=general

    Google Cloud Platform

    The following GPU nodepool and mainpool configurations are based on using the Google gcloud Command Line Interface (CLI).

    GCP GPU Nodepool Sample

    Before setting up the GCP GPU nodepool, install the Nvidia drivers to the Kubernetes cluster with the following command.

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
    
    GCP_PROJECT="YOUR GCP PROJECT"
    GCP_CLUSTER="YOUR CLUSTER NAME"
    GPU_NODEPOOL_NAME="YOUR GPU NODEPOOL NAME"
    REGION="YOUR REGION"
    
    gcloud container \
        --project $GCP_PROJECT \
        node-pools create $GPU_NODEPOOL_NAME \
        --cluster $GCP_CLUSTER \
        --region $REGION \
        --node-version "1.25.8-gke.500" \
        --machine-type "a2-ultragpu-1g" \
        --accelerator "type=nvidia-tesla-a100,count=1,gpu-driver-version=default" \
        --image-type "COS_CONTAINERD" \
        --disk-type "pd-balanced" \
        --disk-size "100" \
        --node-labels wallaroo.ai/node-purpose=pipelines {add custom label here} \ # node-purpose is a required label; custom label - at least one custom label is required
        --node-taints=wallaroo.ai/pipelines=true:NoSchedule \
        --metadata disable-legacy-endpoints=true \
        --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
        --num-nodes "1" \
        --enable-autoscaling \
        --min-nodes "0" \
        --max-nodes "1" \
        --location-policy "BALANCED" \
        --enable-autoupgrade \
        --enable-autorepair \
        --max-surge-upgrade 1 \
        --max-unavailable-upgrade 0
    

    GCP Mainpool Nodepool Sample

    gcloud container clusters \
    create $WALLAROO_CLUSTER \
    --region $WALLAROO_GCP_REGION \
    --node-locations $WALLAROO_NODE_LOCATION \
    --machine-type $DEFAULT_VM_SIZE \
    --disk-size 250 \
    --network $WALLAROO_GCP_NETWORK_NAME \
    --create-subnetwork name=$WALLAROO_GCP_SUBNETWORK_NAME \
    --enable-ip-alias \
    --labels=wallaroo.ai/node-purpose=general \
    --cluster-version=1.23