Deployment Configuration with the Wallaroo SDK

How to manage deployment configurations using the Wallaroo SDK

Deployment Configuration Introduction

Deployments configurations allow tailoring of a model deployments to match an organization’s and model’s requirements. Deployments may require more memory, CPU cores, or GPUs to run to run all its steps efficiently. Deployment configurations also allow for multiple replicas of a model in a deployment to provide scalability.

Deployment Resource Configurations

Deployment configurations deal with two major components:

Native Runtimes: Models that are deployed “as is” with the Wallaroo engine (Onnx, etc).
Containerized Runtimes: Models that are packaged into a container then deployed as a container with the Wallaroo engine (MLFlow, etc).

These configurations can be mixed - both native runtimes and containerized runtimes deployed together, with resources allocated to each runtimes in different configurations.

GPU and CPU Allocation

CPUs are allocated in fractions of total CPU power similar to the Kubernetes CPU definitions. cpus(0.25), cpus(1.0), etc are valid values.

GPUs can only be allocated by entire integer units from the GPU enabled nodepools. gpus(1), gpus(2), etc are valid values, while gpus(0.25) are not.

Organizations should be aware of how many GPUs are allocated to the cluster. If all GPUs are already allocated to other deployments, or if there are not enough GPUs to fulfill the request, the deployment will fail and return an error message.

GPU Support

Wallaroo 2023.2.1 and above supports Kubernetes nodepools with NVIDIA CUDA GPUs.

See the Create GPU Nodepools for Kubernetes Clusters guide for instructions on adding GPU enabled nodepools to a Kubernetes cluster.

IMPORTANT NOTE

If allocating GPUs to a Wallaroo pipeline, the deployment_label configuration option must be used. For example:

import wallaroo

# create the deployment configuration with 4 cpus, 3 Gi RAM, and 1 GPU with the deployment label
wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .build()

Architecture Support

Wallaroo supports x86 and ARM architecture CPUs. For example, Azure supports Ampere® Altra® Arm-based processor included with the following virtual machines:

Model Deployment Architecture Inheritance

Deployment configurations inherit the model’s architecture setting. This is set during model upload by specifying the arch parameter. By default, models uploaded to Wallaroo default to the x86 architecture.

The following model operations inherit the model’s architecture setting.

Model Deployment: Model deployment and Model Deployment Deployment Configuration inherit the the model’s architecture. No specification of the architecture is required for model deployment.
Pipeline Publishing: The Wallaroo engine set when a pipeline is containerized and published to an Open Container Initiative (OCI) Registry inherits the model’s architecture setting.

The following example shows uploading a model set with the architecture set to ARM, and how the deployment inherits that architecture without additional deployment configuration changes. For this example, an ONNX model is uploaded.

import wallaroo

housing_model_control_arm = (wl.upload_model(model_name_arm, 
                                        model_file_name, 
                                        framework=Framework.ONNX,
                                        arch=wallaroo.engine_config.Architecture.ARM)
                                        .configure(tensor_fields=["tensor"])
                        )

display(housing_model_control_arm)

Name	house-price-estimator-arm
Version	163ff0a9-0f1a-4229-bbf2-a19e4385f10f
File Name	rf_model.onnx
SHA	e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6
Status	ready
Image Path	None
Architecture	arm
Acceleration	None
Updated At	2024-04-Mar 20:34:00

Note that the deployment configuration settings, no architecture is specified. When pipeline_arm is displayed, we see the arch setting inherited the model’s arch setting.

pipeline_arm = wl.build_pipeline(arm_pipeline_name)

# set the model step with the ARM targeted model
pipeline_arm.add_model_step(housing_model_control_arm)

#minimum deployment config for this model
deploy_config = wallaroo.DeploymentConfigBuilder().replica_count(1).cpus(1).memory("1Gi").build()

pipeline_arm.deploy(deployment_config = deploy_config)

    Waiting for deployment - this will take up to 45s .......... ok

display(pipeline_arm)

name	architecture-demonstration-arm
created	2024-03-04 20:34:08.895396+00:00
last_updated	2024-03-04 21:52:01.894671+00:00
deployed	True
arch	arm
accel	None
tags
versions	55d834b4-92c8-4a93-b78b-6a224f17f9c1, 98821b85-401a-4ab5-af8e-1b3126727069, 74571863-9eb0-47aa-8b5a-3bdaa7aa9f03, b72fb0db-e4b4-4936-a7cb-3d0fb7827a6f, 3ae70818-10f3-4f61-a998-dee5e2f00daf
steps	house-price-estimator-arm
published	True

Deployment Configuration Defaults

Deployment configurations default to the following^*.

Runtime	CPUs	Memory	GPUs
Wallaroo Native Runtime^**	4	3 Gi	0
Wallaroo Containerized Runtime^***	2	1 Gi	0

*: For Kubernetes limits and requests.
**: Resources are always allocated for the Wallaroo Native Runtime engine even if there are no Wallaroo Native Runtimes included in the deployment, so it is recommended to decrease these resources when pipelines use Containerized Runtimes.
***: Resources for Wallaroo Containerized Runtimes only apply with a Wallaroo Containerized Runtime is part of the deployment.

Deployment Configurations via the Wallaroo SDK

The following details how to set the deployment configuration via the Wallaroo SDK.

The following resources configurations are available through the wallaroo.deployment_config object.

These updates can be edited in the Wallaroo Dashboard after the initial deployment. For more details, see Deployment Configuration via the Wallaroo Dashboard.

Create Deployment Configuration

Setting a deployment configuration follows this process:

Pipeline deployment configurations are created through the wallaroo.deployment_config.DeploymentConfigBuilder() class.
Once the configuration options are set the deployment configuration is set with the wallaroo.deployment_config.DeploymentConfigBuilder().build() method.
The deployment configuration is applied when applied during model deployment via the wallaroo.pipeline.Pipeline.deploy method.

The following example shows a model deployment configuration with 1 replica, 1 cpu, and 2Gi of memory set to be allocated to the deployment configuration. We start by:

Importing the DeploymentConfigBuilder class
Setting the deployment configuration settings
Building the deployment configuration and saving it to a variable
Applying that deployment configuration when we deploy the pipeline

from wallaroo.deployment_config import DeploymentConfigBuilder

deployment_config = wallaroo.DeploymentConfigBuilder()
                    .replica_count(1)
                    .cpus(1)
                    .memory("2Gi")
                    .build()

pipeline.deploy(deployment_config = deployment_config)

Deployment resources can be configured with autoscaling. Autoscaling allows the user to define how many engines a deployment starts with, the minimum amount of engines a deployment uses, and the maximum amount of engines a deployment can scale to. The deployment scales up and down based on the average CPU utilization across the engines in a given deployment as the user’s workload increases and decreases.

Native Runtime Configuration Methods

Method	Parameters	Description	Enterprise Only Feature
`cpus`	`(core_count: float)`	Sets the number or fraction of CPUs to use for the deployment, for example: `0.25`, `1`, `1.5`, etc. The units are similar to the Kubernetes CPU definitions.
`gpus`	`(core_count: int)`	Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If `gpus` is called, then the `deployment_label` must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance.	√
`memory`	`(memory_spec: str)`	Sets the amount of RAM to allocate the deployment. The `memory_spec` string is in the format “{size as number}{unit value}”. The accepted unit values are: KiB (for KiloBytes) MiB (for MegaBytes) GiB (for GigaBytes) TiB (for TeraBytes) The values are similar to the Kubernetes memory resource units format.
`deployment_label`	`(label: string)`	Label used to match the nodepool label used for the deployment. Required if `gpus` are set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo.	√

Containerized Runtime Configuration Methods

Method	Parameters	Description	Enterprise Only Feature
`sidekick_cpus`	(model: wallaroo.model.Model, core_count: float)	Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows: Model model: The sidekick model to configure. float core_count: Number of CPU cores to use in this sidekick.
`sidekick_memory`	`(model: wallaroo.model.Model, memory_spec: str)`	Sets the memory available to for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows: Model model: The sidekick model to configure. memory_spec: The amount of memory to allocated as memory unit values. The accepted unit values are: KiB (for KiloBytes) MiB (for MegaBytes) GiB (for GigaBytes) TiB (for TeraBytes) The values are similar to the Kubernetes memory resource units format.
`sidekick_env`	`(model: wallaroo.model.Model, environment: Dict[str, str])`	Environment variables submitted to the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. These are used specifically for containerized models that have environment variables that effect their performance.
`sidekick_gpus`	`(model: wallaroo.model.Model, core_count: int)`	Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message will be returned during deployment. If called, then the `deployment_label` must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance	√

Deployment Replicas and Autoscale

Wallaroo supports deployment replicas and autoscaling for Wallaroo Enterprise edition. The following parameters are used for different autoscaling options.

Method	Parameters	Description
`replica_count`	`(count: int)`	The number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization.
`replica_autoscale_min_max`	`(maximum: int, minimum: int = 0)`	Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs.
`autoscale_cpu_utilization`	`(cpu_utilization_percentage: int)`	Sets the average CPU percentage metric for when to load or unload another replica.
`scale_up_queue_depth`	`(queue_depth: int)`	The queue trigger for autoscaling additional replicas up. This requires the deployment configuration parameter `replica_autoscale_min_max` is set. `scale_up_queue_depth` is determined by the formula `(number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window)`. This field overrides the deployment configuration parameter `cpu_utilization`. The `scale_up_queue_depth` applies to all resources in the deployment configuration.
`scale_down_queue_depth`	`(queue_depth: int)`, Default: 1	Only applies with `scale_up_queue_depth` is configured. The queue trigger for autoscaling down replicas. The `scale_down_queue_depth` is based on the formula `(number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window)`.
`autoscaling_window`	`(window_seconds: int)` (Default: 300, Minimum allowed: 60)	The period over which to scale up or scale down resources. Only applies when `scale_up_queue_depth` is configured.

A replica allocates the same number of cpus, gpus, and memory per replica. The deployment load balancer distributes inference requests across the replicas to provide consistent service. The number of replicas are set in the following mutually exclusive ways:

Replica Type	Description	Parameters
Set number of replicas	Sets the number of replicas to a constant number via the `replica_count(count)` deployment setting. Recommended for use cases where inference requests are steady over time. The number of replicas is changed by creating a new deployment configuration and deploying again with the new deployment configuration.	`replica_count`
Autoscale Triggers by CPU Utilization	The number of replicas changes depending on the cpu utilization. This is recommended when the inference requests increase or decrease, and provides organizations with a method to decrease the resources needed for a deployment and increase as required.	`replica_autoscale_min_max(maximum: int, minimum: int = 0)` `autoscale_cpu_utilization(cpu_utilization_percentage: int)`
Autoscale Triggers by Queue Depth	Autoscale triggers based on the inference queue depth. Recommended for autoscaling replicas for GPUs and where the inference requests typically increase or decrease over user defined intervals.	`replica_autoscale_min_max(maximum: int, minimum: int = 0)` `scale_up_queue_depth` `scale_down_queue_depth` `autoscale_cpu_utilization`

IMPORTANT NOTE: Autoscale to 0 and CPU Utilization

For use cases where replica_autoscale_min_max has the minimum of 0, cpus must be at least 0.25 or baseline CPU activity will prevent scaling down the final replica to 0.

For example, replica_autoscale_min_max(minimum=0,maximum=5).cpus(0.25).

IMPORTANT NOTE: GPU Allocation and Autoscale

Autoscaling replicas for GPUs should be defined by the queue depth based parameters scale_up_queue_depth, scale_down_queue_depth, and autoscaling_window. In this scenario, scaling is incremental.

Deployment settings, such as the number of gpus, affect the autoscaling feature. The following describes how autoscaling performs based on different use cases.

Replica Type	Behavior	Sample Use Case
Static Replicas	The number of replicas remains constant.	Inference requests are a constant over time to provide constant availability.
CPU Default Scaling	The number of replicas increases or decreases with the default `autoscale_cpu_utilization` of 50%.	Inference requests fluctuate over time. This allows organizations to optimize their spend and the number of resources allocated to a deployment.
CPU Scaling	The number of replicas increases or decreases with the `autoscale_cpu_utilization` of 75%.	Inference requests fluctuate over time, with a manually set CPU utilization before allocating another replica.
Autoscale Trigger for GPU, Autoscale Trigger for CPU	The number of replicas increase or decreases based on the `scale_up_queue_depth`, `scale_down_queue_depth(Default: 1)`, and `autoscaling_window(Default: 300)` settings.	Replicas increase when inference requests meet or exceed the `scale_up_queue_depth` over the `autoscaling_window` period and back down based on the `scale_down_queue_depth` over the `autoscaling_window`.

Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.

wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .build()

Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.

wallaroo.DeploymentConfigBuilder()
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()

Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.

wallaroo.DeploymentConfigBuilder()
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()

Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.

wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .replica_count(5)
    .build()

Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.

wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()

Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.

wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()

Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.

wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .replica_count(5)
    .build()

Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime. For this example, 1 gpu is assigned to the Containerized Runtime, and 0 gpus assigned to the Native Runtime.

wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(0.25)
    .memory('1Gi')
    .gpus(0)
    .deployment_label('doc-gpu-label:true')
    .sidekick_gpus(model, 1)
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()

Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.

Note that this configuration allocates 2 gpus to the model deployment per replica - one for the Native Runtime and one to the model deployed in the Containerized Runtime. Only one deployment label is required.

wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(2)
    .memory('2Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .sidekick_gpus(model, 1)
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()

Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.

wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .cpus(4)
    .memory('3Gi')
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .build()

Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.

wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()

Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.

wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()

Resource Allocation	Behavior
Sets resources to the LLM `llm_gpu` with the following allocations: Replica autoscale: 0 to 5. Wallaroo engine per replica: 1 cpu, 2 Gi RAM `llm_gpu` per replica: 1 gpu 24 Gi RAM scale_up_queue_depth: 5 Autoscaling window: 600	Wallaroo engine and LLM scaling is 1:1. Scaling up occurs when the scale up queue depth is above 5 over 600 seconds

deployment_with_gpu =wallaroo.DeploymentConfigBuilder()
                       .replica_autoscale_min_max(minimum=0, maximum=5)
                       .cpus(1).memory('2Gi')
                       .sidekick_gpus(llm_gpu, 1)
                       .sidekick_memory(llm_gpu, '24Gi')
                       .scale_up_queue_depth(5)
                       .autoscaling_window(600)
                       .build()

Create the Wallaroo pipeline and assign the LLM as a pipeline step.

# create the pipeline
llm_gpu_pipeline = wl.build_pipeline('sample-llm-with-gpu-pipeline')

# add the LLM as a pipeline model step
llm_gpu_pipeline.add_model_step(llm_gpu)

The pipeline is deployed with the deployment_with_gpu deployment.

llm_gpu_pipeline.deploy(deployment_with_gpu)

Inference from Zero Scaled Deployments

For deployments that autoscale from 0 replicas, replica_autoscale_min_max is set with minimum=0 and replicas scale down to zero when there is no utilization based on the autoscale parameters. When a new inference request is made, the first replica is scaled up. Once the first replica is ready, inference requests proceed as normal.

When inferencing in this scenario, a timeout may occur waiting for the first replica to spool up. To handle situations where an autoscale deployment scales down to zero replicas, the following code example provides a way to “wake up” the pipeline with an inference request which may use mock or real data. Once the first replica is fully spooled up, inference requests proceed at full speed.

Once deployed, we check the pipeline’s deployment status to verify it is running. If the pipeline is still scaling, the process waits 10 seconds to allow it to finish scaling before attempting the initial inference again. Once an inference completes successfully, the inferences proceed as normal.

# verify deployment has the status `Running`
while pipeline.status()["status"] != 'Running':
    try:
        # attempt the inference
        pipeline.infer(dataframe)
    except:
        # if an exception is thrown, pass it
        pass
    # wait 10 seconds before attempting the inference again
    time.sleep(10)

# when the inference passes successfully, continue with other inferences as normal
pipeline.infer(dataframe2)
pipeline.infer(dataframe3)