Deployment Configuration with the Wallaroo Dashboard

How to manage deployment configurations using the Wallaroo Dashboard

Table of Contents

Deployment Configuration via the Wallaroo Dashboard

Pipeline deployment configurations are modified through the Wallaroo Dashboard Pipeline Details page. The following preconditions must be met before editing the deployment configuration through the user interface:

  • The pipeline is previously deployed through the Wallaroo SDK or the Wallaroo MLOps API.
  • The pipeline is currently undeployed.

Editing Deployment Configuration Steps

The following steps are used for updating the a pipeline’s deployment configuration through the Wallaroo Dashboard.

  1. From the Wallaroo Dashboard, select the workspace the target pipeline is associated with.
  2. Select View Pipelines.
  3. Select the pipeline to update.
  4. From the Details page, verify that the pipeline is Undeployed - the Deploy/Undeploy button will display Deploy if the pipeline is currently undeployed.
  5. Scroll down to Deployment Configuration and select Edit.
  6. Edit each field as required. It is highly recommended to only edit existing settings when possible and make major modifications through the Wallaroo SDK or Wallaroo MLOps API.
  7. When finished, select Save and Deploy. The pipeline will be deployed as a new verison with the new deployment configuration.

Edit Configuration Deployment Examples

Edit Native Runtime Deployment Configuration Example

The following demonstrates editing the deployment configuration for a Wallaroo Native Runtime deployment.

Edit Containerized Runtime Deployment Configuration Example

The following demonstrates editing the deployment configuration for a Wallaroo Containerized Runtime deployment.

Deployment Configuration Parameters

The following deployment configurations parameters are available for editing. Before starting, the following conditions must be noted:

  • Deployment configurations are only available to previously deployed pipelines, whether they were were deployed through the Wallaroo SDK or the Wallaroo MLOps API.
  • Deployment configurations are only editible through the Wallaroo Dashboard when the pipeline is undeployed.
  • Field and value types must match the deployment configurations and types. For example: string values for labels, integer values for gpus, etc. The following tables show the deployment configuration parameters for Wallaroo Native Runtimes and Wallaroo Containerized Runtimes.

Deployment configuration parameters fall under the following elements:

  • engine: These elements are specific to Wallaroo Native Runtimes.
  • engineAux: These elements are specific to Wallaroo Containerized Runtimes.

The following elements are not editable from the Wallaroo Dashboard Pipeline Details page:

  • workspace_id
  • engine_lb

The following examples show different deployment parameters based on the Runtime and configurations.

  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
  "engine": {
    "cpu": 0.25,
    "arch": "x86",
    "accel": "none",
    "resources": {
      "limits": {
        "cpu": 0.25,
        "memory": "4Gi"
      },
      "requests": {
        "cpu": 0.25,
        "memory": "4Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {}
  },
  "workspace_id": 9,
  "node_selector": {}
}
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
  "engine": {
    "cpu": 0.25,
    "arch": "x86",
    "accel": "none",
    "resources": {
      "limits": {
        "cpu": 0.25,
        "memory": "1Gi"
      },
      "requests": {
        "cpu": 0.25,
        "memory": "1Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {
      "clip-vit-2": {
        "arch": "x86",
        "accel": "none",
        "resources": {
          "limits": {
            "cpu": 2,
            "memory": "4Gi"
          },
          "requests": {
            "cpu": 2,
            "memory": "4Gi"
          }
        }
      }
    }
  },
  "workspace_id": 10,
  "node_selector": {}
}
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
  "engine": {
    "cpu": 4,
    "arch": "x86",
    "accel": "none",
    "replicas": 5,
    "resources": {
      "limits": {
        "cpu": 4,
        "memory": "3Gi"
      },
      "requests": {
        "cpu": 4,
        "memory": "3Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {}
  },
  "workspace_id": 9,
  "node_selector": {}
}
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
  "engine": {
    "cpu": 0.25,
    "arch": "x86",
    "accel": "none",
    "replicas": 5,
    "resources": {
      "limits": {
        "cpu": 0.25,
        "memory": "1Gi"
      },
      "requests": {
        "cpu": 0.25,
        "memory": "1Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {
      "clip-vit-2": {
        "arch": "x86",
        "accel": "none",
        "resources": {
          "limits": {
            "cpu": 4,
            "memory": "3Gi"
          },
          "requests": {
            "cpu": 4,
            "memory": "3Gi"
          }
        }
      }
    }
  },
  "workspace_id": 10,
  "node_selector": {}
}
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
  "engine": {
    "cpu": "0.5",
    "gpu": 1,
    "arch": "x86",
    "accel": "none",
    "replicas": 5,
    "resources": {
      "limits": {
        "cpu": "0.5",
        "nvidia.com/gpu":1,
        "memory": "2Gi"
      },
      "requests": {
        "cpu": "0.5",
        "nvidia.com/gpu":1,
        "memory": "2Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {}
  },
  "workspace_id": 10,
  "node_selector":"wallaroo.ai/accelerator: t4",
}
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
  "engine": {
    "cpu": 0.25,
    "arch": "x86",
    "accel": "none",
    "replicas": 5,
    "resources": {
      "limits": {
        "cpu": 0.25,
        "memory": "1Gi"
      },
      "requests": {
        "cpu": 0.25,
        "memory": "1Gi"
      }
    },
  },
  "enginelb": {},
  "engineAux": {
    "images": {
      "llama-cpp-sdk-3": {
        "arch": "x86",
        "accel": "none",
        "resources": {
          "limits": {
            "cpu": 4,
            "nvidia.com/gpu":1,
            "memory": "10Gi"
          },
          "requests": {
            "cpu": 4,
            "nvidia.com/gpu":1,
            "memory": "10Gi"
          }
        }
      }
    }
  },
  "workspace_id": 10,
  "node_selector":"wallaroo.ai/accelerator: t4"
}
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
  "engine": {
    "cpu": 0.25,
    "arch": "x86",
    "accel": "none",
    "autoscale":{
      "type":"cpu"
      "replica_max": 5
      "replica_min": 0
      "cpu_utilization": 75
    }
    "replicas": 2,
    "resources": {
      "limits": {
        "cpu": 0.25,
        "memory": "1Gi"
      },
      "requests": {
        "cpu": 0.25,
        "memory": "1Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {}
  },
  "workspace_id": 10,
  "node_selector": {}
}
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
  "engine": {
    "cpu": 0.25,
    "arch": "x86",
    "accel": "none",
    "autoscale":{
      "type":"cpu"
      "replica_max": 5
      "replica_min": 0
      "cpu_utilization": 75
    }
    "replicas": 2,
    "resources": {
      "limits": {
        "cpu": 0.25,
        "memory": "1Gi"
      },
      "requests": {
        "cpu": 0.25,
        "memory": "1Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {
      "clip-vit-2": {
        "arch": "x86",
        "accel": "none",
        "resources": {
          "limits": {
            "cpu": 2,
            "memory": "4Gi"
          },
          "requests": {
            "cpu": 2,
            "memory": "4Gi"
          }
        }
      }
    }
  },
  "workspace_id": 10,
  "node_selector": {}
}

When autoscaling with GPU, the recommended parameters are scale_up_queue_depth, scale_down_queue_depth and autoscaling_window. For more details, see Wallaroo Deployment via the Wallaroo SDK: Deployment Replicas and Autoscale.

  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
  "engine": {
    "cpu": 0.25,
    "gpu": 1,
    "arch": "x86",
    "accel": "none",
    "autoscale":{
      "type": "queue",
      "replica_max": 2,
      "replica_min": 0,
      "autoscaling_window": 60,
      "scale_up_queue_depth": 5,
      "scale_down_queue_depth": 1
    }
    "resources": {
      "limits": {
        "cpu": 0.25,
        "nvidia.com/gpu":1,
        "memory": "1Gi"
      },
      "requests": {
        "cpu": 0.25,
        "nvidia.com/gpu":1,
        "memory": "1Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {}
  },
  "workspace_id": 10,
  "node_selector":"wallaroo.ai/accelerator: t4"
}
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
  "engine": {
    "cpu": 0.25,
    "arch": "x86",
    "accel": "none",
    "autoscale":{
      "type": "queue",
      "replica_max": 2,
      "replica_min": 0,
      "autoscaling_window": 60,
      "scale_up_queue_depth": 5,
      "scale_down_queue_depth": 1
    },
    "resources": {
      "limits": {
        "cpu": 0.25,
        "memory": "1Gi"
      },
      "requests": {
        "cpu": 0.25,
        "memory": "1Gi"
      }
    }
  },
  "enginelb": {},
  "engineAux": {
    "images": {
      "clip-vit-2": {
        "arch": "x86",
        "accel": "none",
        "resources": {
          "limits": {
            "cpu": 2,
            "nvidia.com/gpu":1,
            "memory": "4Gi"
          },
          "requests": {
            "cpu": 2,
            "nvidia.com/gpu":1,
            "memory": "4Gi"
          }
        }
      }
    }
  },
  "workspace_id": 10,
  "node_selector":"wallaroo.ai/accelerator: t4"
}

Deployment Replicas and Autoscale Parameters

The following parameters are available for controlling replicas and autoscaling options. Note that certain options are mutually exclusive - for example, engine.replicas are mutually exclusive with engine.autoscale.replica_max and engine.autoscale.replica_min. For more details, see Wallaroo Deployment via the Wallaroo SDK: Deployment Replicas and Autoscale.

Replica and autoscale settings apply to both Native and Containerized Runtimes.

ParametersTypeDescriptionRelated Parameters
engine.replicasIntegerThe number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization.None
engine.autoscale.typeStringThe type of autoscaling. Defaults to cpu. Valid options include:
  • cpu
  • queue
.
None
engine.autoscale.replica_maxIntegerThe maximum number of replicas scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs.None
engine.autoscale.replica_minIntegerThe minimum number of replicas scaled from the replica_min to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs.None
engine.autoscale.cpu_utilizationFloatSets the average CPU percentage metric for when to load or unload another replica.None
engine.autoscale.scale_up_queue_depthIntegerThe queue trigger for autoscaling additional replicas up. This requires the deployment configuration parameter replica_autoscale_min_max is set.None
engine.autoscale.scale_down_queue_depthInteger Default: 1Only applies with scale_up_queue_depth is configured. The queue trigger for autoscaling replicas down.None
engine.autoscale.autoscaling_windowInteger (Default: 300, Minimum allowed: 60)The period over which to scale up or scale down resources. Only applies when scale_up_queue_depth is configured.None

Native Runtime Configuration Parameters

The following parameters are available for Wallaroo Native Runtime deployments. Note that resources assigned to the Wallaroo Native Runtime are shared with all models that run in the Native Runtime.

Related Parameters must be edited together. For example, engine.cpu settings must match the ones for engine.resources.limits.cpu and engine.requests.cpu.

The following is a sample Native Runtime Configuration, followed by a table of the Native Runtime Configuration parameters.

Native Runtime Deployment Config Edit
ParametersTypeDescriptionRelated Parameters
engine.cpuFloatThe fractional number of cpus assigned to the Wallaroo Native Runtime per replica.
  • engine.resources.limits.cpu
  • engine.requests.cpu
engine.gpuBoolWhether to assign a GPU to the Wallaroo Native Runtime. For GPU configurations the default is NVIDIA when no acceleration is specified. For other GPU configurations please see Inference with Acceleration Libraries during model upload. 
engine.resources.limits.memoryStringSets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
  • KiB (for KiloBytes)
  • MiB (for MegaBytes)
  • GiB (for GigaBytes)
  • TiB (for TeraBytes)
The values are similar to the Kubernetes memory resource units format.
engine.requests.memory

Containerized Runtime Configuration Methods

The following editable are available for Wallaroo Containerized Runtime deployments. Note that resources assigned to the Wallaroo Containerized Runtime are specific per model. For example, one model may have more cpus and memory assigned to another model, and those resources are exclusive to each model in the Containerized Runtime.

Related Parameters must be edited together. For Containerized Runtime settings, the resources are assigned to each model, so each setting is in the format engineAux.images.{model name}.parameter - for example, the number of cpus assigned to a model named sample-llm would be engineAux.images.sample-llm.resources.limits.cpu and engineAux.images.sample-llm-requests.cpu.

The following is a sample Containerized Runtime Configuration, followed by a table of the Containerized Runtime Configuration parameters.

Native Runtime Deployment Config Edit
ParametersTypeDescriptionRelated Parameters
engineAux.images.{model_name}.resources.limits.cpuFloatThe fractional number of cpus assigned to the model per replica.
  • engineAux.images.{model_name}.requests.cpu
engineAux.images.{model_name}.resources.gpuBoolWhether to assign a GPU to the model. For GPU configurations the default is NVIDIA when no acceleration is specified. For other GPU configurations please see Inference with Acceleration Libraries during model upload. 
engineAux.images.{model_name}.resources.limits.memoryStringSets the amount of RAM to allocate to the model. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
  • KiB (for KiloBytes)
  • MiB (for MegaBytes)
  • GiB (for GigaBytes)
  • TiB (for TeraBytes)
The values are similar to the Kubernetes memory resource units format.
engineAux.images.{model_name}.requests.memory

Troubleshooting

Uneditable Fields

The following fields can not be edited through the Wallaroo Dashboard Pipeline Details page:

  • workspace_id
  • engine_lb

Retrieve Inference Metrics and Logs Via the Wallaroo Dashboard

Inference logs from Wallaroo deployments are available through the Wallaroo Dashboard through the Pipeline Metrics page. This provides a method of viewing how the deployment configurations impact inference performance.

Pipeline Metrics Overview

The Pipeline Metrics page contains the following elements.

Wallaroo Pipeline Metrics Overview
  • Pipeline Name and Identifier (A): The pipeline’s assigned name and unique identifier in UUID format.

  • Filter Edges (B): By default, all locations are displayed. Filter Edges provides a list of available edge deployments are displayed. Selecting one or more filters from the list limits the available metrics and logs displayed to only the selected locations.

    Filter Edges
  • Status (C): The status of the pipeline. The status only applies to the pipeline’s status in the Wallaroo Ops instance.. Options are:

    • Active: The pipeline is deployed.
    • Inactive: The pipeline is not deployed.
  • Tags (D): Any tags applied to the pipeline.

  • Inference Urls (E): The internal and external inference URLs for the deployed pipeline in the Wallaroo Ops instance.

  • Date Filter (F): Filter date and time to specify the period of time for inference requests to collect for the metrics.

  • Deploy/Undeploy the Pipeline (G): Deploy an inactive pipeline, or deploy an active pipeline in the Wallaroo Ops instance.

  • Requests per Second (H): The number of inference requests per second in the filtered date period. This chart data can be downloaded and shared with other users.

  • Cluster inference rate (I): The rate of inference requests completed in the filtered date period. This chart data can be downloaded and shared with other users.

  • Inference Latency (J): The latency between when an inference request is received versus when it is completed in the filtered date period. This chart data can be downloaded and shared with other users.

  • Engine Replicas (K): Tracks the number of replicas deployed for the Wallaroo Native Runtime. In the displayed example, the replica count started at 0, then went to 3 replicas when the pipeline was deployed. For details on deployment configurations, see Deployment Configuration with the Wallaroo SDK.

  • EngineAux Replicas (L): Tracks the number of replicas deployed for the Wallaroo Containerized Runtime. In the displayed example, the replica count started at 0, then went to 3 replicas when the pipeline was deployed. For details on deployment configurations, see Deployment Configuration with the Wallaroo SDK.

  • Activity (M): Comments left by users.

  • Audit Log (N): The inference audit logs. The logs are filtered by the Filter Edges settings and Date Filter.

    • If an inference result output is greater than 100k in for any field, the field results show NULL in the Wallaroo Dashboard:

      Filtered Log Entry

      Full inference results are always returned with either the Wallaroo SDK or the MLOps API. Large inference results are filtered only in the display.

  • Anomaly Log (O): The anomaly audit logs. These are included when anomaly detection validation rules are triggered. The logs are filtered by the Filter Edges settings and Date Filter.