Deployment Configuration with the Wallaroo Dashboard
Table of Contents
Deployment Configuration via the Wallaroo Dashboard
Pipeline deployment configurations are modified through the Wallaroo Dashboard Pipeline Details page. The following preconditions must be met before editing the deployment configuration through the user interface:
- The pipeline is previously deployed through the Wallaroo SDK or the Wallaroo MLOps API.
- The pipeline is currently undeployed.
Editing Deployment Configuration Steps
The following steps are used for updating the a pipeline’s deployment configuration through the Wallaroo Dashboard.
- From the Wallaroo Dashboard, select the workspace the target pipeline is associated with.
- Select View Pipelines.
- Select the pipeline to update.
- From the Details page, verify that the pipeline is Undeployed - the Deploy/Undeploy button will display Deploy if the pipeline is currently undeployed.
- Scroll down to Deployment Configuration and select Edit.
- Edit each field as required. It is highly recommended to only edit existing settings when possible and make major modifications through the Wallaroo SDK or Wallaroo MLOps API.
- When finished, select Save and Deploy. The pipeline will be deployed as a new verison with the new deployment configuration.
Edit Configuration Deployment Examples
Edit Native Runtime Deployment Configuration Example
The following demonstrates editing the deployment configuration for a Wallaroo Native Runtime deployment.
Edit Containerized Runtime Deployment Configuration Example
The following demonstrates editing the deployment configuration for a Wallaroo Containerized Runtime deployment.
Deployment Configuration Parameters
The following deployment configurations parameters are available for editing. Before starting, the following conditions must be noted:
- Deployment configurations are only available to previously deployed pipelines, whether they were were deployed through the Wallaroo SDK or the Wallaroo MLOps API.
- Deployment configurations are only editible through the Wallaroo Dashboard when the pipeline is undeployed.
- Field and value types must match the deployment configurations and types. For example: string values for labels, integer values for gpus, etc. The following tables show the deployment configuration parameters for Wallaroo Native Runtimes and Wallaroo Containerized Runtimes.
Deployment configuration parameters fall under the following elements:
engine
: These elements are specific to Wallaroo Native Runtimes.engineAux
: These elements are specific to Wallaroo Containerized Runtimes.
The following elements are not editable from the Wallaroo Dashboard Pipeline Details page:
workspace_id
engine_lb
The following examples show different deployment parameters based on the Runtime and configurations.
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
"engine": {
"cpu": 0.25,
"arch": "x86",
"accel": "none",
"resources": {
"limits": {
"cpu": 0.25,
"memory": "4Gi"
},
"requests": {
"cpu": 0.25,
"memory": "4Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {}
},
"workspace_id": 9,
"node_selector": {}
}
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
"engine": {
"cpu": 0.25,
"arch": "x86",
"accel": "none",
"resources": {
"limits": {
"cpu": 0.25,
"memory": "1Gi"
},
"requests": {
"cpu": 0.25,
"memory": "1Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {
"clip-vit-2": {
"arch": "x86",
"accel": "none",
"resources": {
"limits": {
"cpu": 2,
"memory": "4Gi"
},
"requests": {
"cpu": 2,
"memory": "4Gi"
}
}
}
}
},
"workspace_id": 10,
"node_selector": {}
}
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
"engine": {
"cpu": 4,
"arch": "x86",
"accel": "none",
"replicas": 5,
"resources": {
"limits": {
"cpu": 4,
"memory": "3Gi"
},
"requests": {
"cpu": 4,
"memory": "3Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {}
},
"workspace_id": 9,
"node_selector": {}
}
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
"engine": {
"cpu": 0.25,
"arch": "x86",
"accel": "none",
"replicas": 5,
"resources": {
"limits": {
"cpu": 0.25,
"memory": "1Gi"
},
"requests": {
"cpu": 0.25,
"memory": "1Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {
"clip-vit-2": {
"arch": "x86",
"accel": "none",
"resources": {
"limits": {
"cpu": 4,
"memory": "3Gi"
},
"requests": {
"cpu": 4,
"memory": "3Gi"
}
}
}
}
},
"workspace_id": 10,
"node_selector": {}
}
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
"engine": {
"cpu": "0.5",
"gpu": 1,
"arch": "x86",
"accel": "none",
"replicas": 5,
"resources": {
"limits": {
"cpu": "0.5",
"nvidia.com/gpu":1,
"memory": "2Gi"
},
"requests": {
"cpu": "0.5",
"nvidia.com/gpu":1,
"memory": "2Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {}
},
"workspace_id": 10,
"node_selector":"wallaroo.ai/accelerator: t4",
}
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
"engine": {
"cpu": 0.25,
"arch": "x86",
"accel": "none",
"replicas": 5,
"resources": {
"limits": {
"cpu": 0.25,
"memory": "1Gi"
},
"requests": {
"cpu": 0.25,
"memory": "1Gi"
}
},
},
"enginelb": {},
"engineAux": {
"images": {
"llama-cpp-sdk-3": {
"arch": "x86",
"accel": "none",
"resources": {
"limits": {
"cpu": 4,
"nvidia.com/gpu":1,
"memory": "10Gi"
},
"requests": {
"cpu": 4,
"nvidia.com/gpu":1,
"memory": "10Gi"
}
}
}
}
},
"workspace_id": 10,
"node_selector":"wallaroo.ai/accelerator: t4"
}
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
"engine": {
"cpu": 0.25,
"arch": "x86",
"accel": "none",
"autoscale":{
"type":"cpu"
"replica_max": 5
"replica_min": 0
"cpu_utilization": 75
}
"replicas": 2,
"resources": {
"limits": {
"cpu": 0.25,
"memory": "1Gi"
},
"requests": {
"cpu": 0.25,
"memory": "1Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {}
},
"workspace_id": 10,
"node_selector": {}
}
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
"engine": {
"cpu": 0.25,
"arch": "x86",
"accel": "none",
"autoscale":{
"type":"cpu"
"replica_max": 5
"replica_min": 0
"cpu_utilization": 75
}
"replicas": 2,
"resources": {
"limits": {
"cpu": 0.25,
"memory": "1Gi"
},
"requests": {
"cpu": 0.25,
"memory": "1Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {
"clip-vit-2": {
"arch": "x86",
"accel": "none",
"resources": {
"limits": {
"cpu": 2,
"memory": "4Gi"
},
"requests": {
"cpu": 2,
"memory": "4Gi"
}
}
}
}
},
"workspace_id": 10,
"node_selector": {}
}
When autoscaling with GPU, the recommended parameters are scale_up_queue_depth
, scale_down_queue_depth
and autoscaling_window
. For more details, see Wallaroo Deployment via the Wallaroo SDK: Deployment Replicas and Autoscale.
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
{
"engine": {
"cpu": 0.25,
"gpu": 1,
"arch": "x86",
"accel": "none",
"autoscale":{
"type": "queue",
"replica_max": 2,
"replica_min": 0,
"autoscaling_window": 60,
"scale_up_queue_depth": 5,
"scale_down_queue_depth": 1
}
"resources": {
"limits": {
"cpu": 0.25,
"nvidia.com/gpu":1,
"memory": "1Gi"
},
"requests": {
"cpu": 0.25,
"nvidia.com/gpu":1,
"memory": "1Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {}
},
"workspace_id": 10,
"node_selector":"wallaroo.ai/accelerator: t4"
}
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
{
"engine": {
"cpu": 0.25,
"arch": "x86",
"accel": "none",
"autoscale":{
"type": "queue",
"replica_max": 2,
"replica_min": 0,
"autoscaling_window": 60,
"scale_up_queue_depth": 5,
"scale_down_queue_depth": 1
},
"resources": {
"limits": {
"cpu": 0.25,
"memory": "1Gi"
},
"requests": {
"cpu": 0.25,
"memory": "1Gi"
}
}
},
"enginelb": {},
"engineAux": {
"images": {
"clip-vit-2": {
"arch": "x86",
"accel": "none",
"resources": {
"limits": {
"cpu": 2,
"nvidia.com/gpu":1,
"memory": "4Gi"
},
"requests": {
"cpu": 2,
"nvidia.com/gpu":1,
"memory": "4Gi"
}
}
}
}
},
"workspace_id": 10,
"node_selector":"wallaroo.ai/accelerator: t4"
}
Deployment Replicas and Autoscale Parameters
The following parameters are available for controlling replicas and autoscaling options. Note that certain options are mutually exclusive - for example, engine.replicas
are mutually exclusive with engine.autoscale.replica_max
and engine.autoscale.replica_min
. For more details, see Wallaroo Deployment via the Wallaroo SDK: Deployment Replicas and Autoscale.
Replica and autoscale settings apply to both Native and Containerized Runtimes.
Parameters | Type | Description | Related Parameters |
---|---|---|---|
engine.replicas | Integer | The number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization. | None |
engine.autoscale.type | String | The type of autoscaling. Defaults to cpu . Valid options include:
| None |
engine.autoscale.replica_max | Integer | The maximum number of replicas scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs. | None |
engine.autoscale.replica_min | Integer | The minimum number of replicas scaled from the replica_min to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs. | None |
engine.autoscale.cpu_utilization | Float | Sets the average CPU percentage metric for when to load or unload another replica. | None |
engine.autoscale.scale_up_queue_depth | Integer | The queue trigger for autoscaling additional replicas up. This requires the deployment configuration parameter replica_autoscale_min_max is set. | None |
engine.autoscale.scale_down_queue_depth | Integer Default: 1 | Only applies with scale_up_queue_depth is configured. The queue trigger for autoscaling replicas down. | None |
engine.autoscale.autoscaling_window | Integer (Default: 300, Minimum allowed: 60) | The period over which to scale up or scale down resources. Only applies when scale_up_queue_depth is configured. | None |
Native Runtime Configuration Parameters
The following parameters are available for Wallaroo Native Runtime deployments. Note that resources assigned to the Wallaroo Native Runtime are shared with all models that run in the Native Runtime.
Related Parameters must be edited together. For example, engine.cpu
settings must match the ones for engine.resources.limits.cpu
and engine.requests.cpu
.
The following is a sample Native Runtime Configuration, followed by a table of the Native Runtime Configuration parameters.
Parameters | Type | Description | Related Parameters |
---|---|---|---|
engine.cpu | Float | The fractional number of cpus assigned to the Wallaroo Native Runtime per replica. |
|
engine.gpu | Bool | Whether to assign a GPU to the Wallaroo Native Runtime. For GPU configurations the default is NVIDIA when no acceleration is specified. For other GPU configurations please see Inference with Acceleration Libraries during model upload. | |
engine.resources.limits.memory | String | Sets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
| engine.requests.memory |
Containerized Runtime Configuration Methods
The following editable are available for Wallaroo Containerized Runtime deployments. Note that resources assigned to the Wallaroo Containerized Runtime are specific per model. For example, one model may have more cpus and memory assigned to another model, and those resources are exclusive to each model in the Containerized Runtime.
Related Parameters must be edited together. For Containerized Runtime settings, the resources are assigned to each model, so each setting is in the format engineAux.images.{model name}.parameter
- for example, the number of cpus assigned to a model named sample-llm
would be engineAux.images.sample-llm.resources.limits.cpu
and engineAux.images.sample-llm-requests.cpu
.
The following is a sample Containerized Runtime Configuration, followed by a table of the Containerized Runtime Configuration parameters.
Parameters | Type | Description | Related Parameters |
---|---|---|---|
engineAux.images.{model_name}.resources.limits.cpu | Float | The fractional number of cpus assigned to the model per replica. |
|
engineAux.images.{model_name}.resources.gpu | Bool | Whether to assign a GPU to the model. For GPU configurations the default is NVIDIA when no acceleration is specified. For other GPU configurations please see Inference with Acceleration Libraries during model upload. | |
engineAux.images.{model_name}.resources.limits.memory | String | Sets the amount of RAM to allocate to the model. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
| engineAux.images.{model_name}.requests.memory |
Troubleshooting
Uneditable Fields
The following fields can not be edited through the Wallaroo Dashboard Pipeline Details page:
workspace_id
engine_lb
Retrieve Inference Metrics and Logs Via the Wallaroo Dashboard
Inference logs from Wallaroo deployments are available through the Wallaroo Dashboard through the Pipeline Metrics page. This provides a method of viewing how the deployment configurations impact inference performance.
Pipeline Metrics Overview
The Pipeline Metrics page contains the following elements.
Pipeline Name and Identifier (A): The pipeline’s assigned name and unique identifier in UUID format.
Filter Edges (B): By default, all locations are displayed. Filter Edges provides a list of available edge deployments are displayed. Selecting one or more filters from the list limits the available metrics and logs displayed to only the selected locations.
Status (C): The status of the pipeline. The status only applies to the pipeline’s status in the Wallaroo Ops instance.. Options are:
- Active: The pipeline is deployed.
- Inactive: The pipeline is not deployed.
Tags (D): Any tags applied to the pipeline.
Inference Urls (E): The internal and external inference URLs for the deployed pipeline in the Wallaroo Ops instance.
Date Filter (F): Filter date and time to specify the period of time for inference requests to collect for the metrics.
Deploy/Undeploy the Pipeline (G): Deploy an inactive pipeline, or deploy an active pipeline in the Wallaroo Ops instance.
Requests per Second (H): The number of inference requests per second in the filtered date period. This chart data can be downloaded and shared with other users.
Cluster inference rate (I): The rate of inference requests completed in the filtered date period. This chart data can be downloaded and shared with other users.
Inference Latency (J): The latency between when an inference request is received versus when it is completed in the filtered date period. This chart data can be downloaded and shared with other users.
Engine Replicas (K): Tracks the number of replicas deployed for the Wallaroo Native Runtime. In the displayed example, the replica count started at 0, then went to 3 replicas when the pipeline was deployed. For details on deployment configurations, see Deployment Configuration with the Wallaroo SDK.
EngineAux Replicas (L): Tracks the number of replicas deployed for the Wallaroo Containerized Runtime. In the displayed example, the replica count started at 0, then went to 3 replicas when the pipeline was deployed. For details on deployment configurations, see Deployment Configuration with the Wallaroo SDK.
Activity (M): Comments left by users.
Audit Log (N): The inference audit logs. The logs are filtered by the Filter Edges settings and Date Filter.
If an inference result output is greater than 100k in for any field, the field results show
NULL
in the Wallaroo Dashboard:Full inference results are always returned with either the Wallaroo SDK or the MLOps API. Large inference results are filtered only in the display.
Anomaly Log (O): The anomaly audit logs. These are included when anomaly detection validation rules are triggered. The logs are filtered by the Filter Edges settings and Date Filter.