Deployment Configuration with the Wallaroo SDK

How to manage deployment configurations using the Wallaroo SDK

Deployment Configuration Introduction

Deployments configurations allow tailoring of a model deployments to match an organization’s and model’s requirements. Deployments may require more memory, CPU cores, or GPUs to run to run all its steps efficiently. Deployment configurations also allow for multiple replicas of a model in a deployment to provide scalability.

Deployment Resource Configurations

Deployment configurations deal with two major components:

  • Native Runtimes: Models that are deployed “as is” with the Wallaroo engine (Onnx, etc).
  • Containerized Runtimes: Models that are packaged into a container then deployed as a container with the Wallaroo engine (MLFlow, etc).

These configurations can be mixed - both native runtimes and containerized runtimes deployed together, with resources allocated to each runtimes in different configurations.

GPU and CPU Allocation

CPUs are allocated in fractions of total CPU power similar to the Kubernetes CPU definitions. cpus(0.25), cpus(1.0), etc are valid values.

GPUs can only be allocated by entire integer units from the GPU enabled nodepools. gpus(1), gpus(2), etc are valid values, while gpus(0.25) are not.

Organizations should be aware of how many GPUs are allocated to the cluster. If all GPUs are already allocated to other deployments, or if there are not enough GPUs to fulfill the request, the deployment will fail and return an error message.

GPU Support

Wallaroo 2023.2.1 and above supports Kubernetes nodepools with Nvidia Cuda GPUs.

See the Create GPU Nodepools for Kubernetes Clusters guide for instructions on adding GPU enabled nodepools to a Kubernetes cluster.

Architecture Support

Wallaroo supports x86 and ARM architecture CPUs. For example, Azure supports Ampere® Altra® Arm-based processor included with the following virtual machines:

Model Deployment Architecture Inheritance

Deployment configurations inherit the model’s architecture setting. This is set during model upload by specifying the arch parameter. By default, models uploaded to Wallaroo default to the x86 architecture.

The following model operations inherit the model’s architecture setting.

The following example shows uploading a model set with the architecture set to ARM, and how the deployment inherits that architecture without additional deployment configuration changes. For this example, an ONNX model is uploaded.

import wallaroo

housing_model_control_arm = (wl.upload_model(model_name_arm, 
                                        model_file_name, 
                                        framework=Framework.ONNX,
                                        arch=wallaroo.engine_config.Architecture.ARM)
                                        .configure(tensor_fields=["tensor"])
                        )

display(housing_model_control_arm)
Namehouse-price-estimator-arm
Version163ff0a9-0f1a-4229-bbf2-a19e4385f10f
File Namerf_model.onnx
SHAe22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6
Statusready
Image PathNone
Architecturearm
AccelerationNone
Updated At2024-04-Mar 20:34:00

Note that the deployment configuration settings, no architecture is specified. When pipeline_arm is displayed, we see the arch setting inherited the model’s arch setting.

pipeline_arm = wl.build_pipeline(arm_pipeline_name)

# set the model step with the ARM targeted model
pipeline_arm.add_model_step(housing_model_control_arm)

#minimum deployment config for this model
deploy_config = wallaroo.DeploymentConfigBuilder().replica_count(1).cpus(1).memory("1Gi").build()

pipeline_arm.deploy(deployment_config = deploy_config)

    Waiting for deployment - this will take up to 45s .......... ok

display(pipeline_arm)
namearchitecture-demonstration-arm
created2024-03-04 20:34:08.895396+00:00
last_updated2024-03-04 21:52:01.894671+00:00
deployedTrue
archarm
accelNone
tags
versions55d834b4-92c8-4a93-b78b-6a224f17f9c1, 98821b85-401a-4ab5-af8e-1b3126727069, 74571863-9eb0-47aa-8b5a-3bdaa7aa9f03, b72fb0db-e4b4-4936-a7cb-3d0fb7827a6f, 3ae70818-10f3-4f61-a998-dee5e2f00daf
stepshouse-price-estimator-arm
publishedTrue

Deployment Configuration Defaults

Deployment configurations default to the following*.

RuntimeCPUsMemoryGPUs
Wallaroo Native Runtime**43 Gi0
Wallaroo Containerized Runtime***21 Gi0

*: For Kubernetes limits and requests.
**: Resources are always allocated for the Wallaroo Native Runtime engine even if there are no Wallaroo Native Runtimes included in the deployment, so it is recommended to decrease these resources when pipelines use Containerized Runtimes.
***: Resources for Wallaroo Containerized Runtimes only apply with a Wallaroo Containerized Runtime is part of the deployment.

Deployment Configurations via the Wallaroo SDK

The following details how to set the deployment configuration via the Wallaroo SDK.

The following resources configurations are available through the wallaroo.deployment_config object.

These updates can be edited in the Wallaroo Dashboard after the initial deployment. For more details, see Deployment Configuration via the Wallaroo Dashboard.

Create Deployment Configuration

Setting a deployment configuration follows this process:

  1. Pipeline deployment configurations are created through the wallaroo.deployment_config.DeploymentConfigBuilder() class.
  2. Once the configuration options are set the deployment configuration is set with the wallaroo.deployment_config.DeploymentConfigBuilder().build() method.
  3. The deployment configuration is applied when applied during model deployment via the wallaroo.pipeline.Pipeline.deploy method.

The following example shows a model deployment configuration with 1 replica, 1 cpu, and 2Gi of memory set to be allocated to the deployment configuration. We start by:

  • Importing the DeploymentConfigBuilder class
  • Setting the deployment configuration settings
  • Building the deployment configuration and saving it to a variable
  • Applying that deployment configuration when we deploy the pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder

deployment_config = wallaroo.DeploymentConfigBuilder()
                    .replica_count(1)
                    .cpus(1)
                    .memory("2Gi")
                    .build()

pipeline.deploy(deployment_config = deployment_config)

Deployment resources can be configured with autoscaling. Autoscaling allows the user to define how many engines a deployment starts with, the minimum amount of engines a deployment uses, and the maximum amount of engines a deployment can scale to. The deployment scales up and down based on the average CPU utilization across the engines in a given deployment as the user’s workload increases and decreases.

Native Runtime Configuration Methods

MethodParametersDescriptionEnterprise Only Feature
cpus(core_count: float)Sets the number or fraction of CPUs to use for the deployment, for example: 0.25, 1, 1.5, etc. The units are similar to the Kubernetes CPU definitions. 
gpus(core_count: int)Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If gpus is called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance.
memory(memory_spec: str)Sets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
  • KiB (for KiloBytes)
  • MiB (for MegaBytes)
  • GiB (for GigaBytes)
  • TiB (for TeraBytes)
The values are similar to the Kubernetes memory resource units format.
 
deployment_label(label: string)Label used to match the nodepool label used for the deployment. Required if gpus are set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo.

Containerized Runtime Configuration Methods

MethodParametersDescriptionEnterprise Only Feature
sidekick_cpus(model: wallaroo.model.Model, core_count: float)Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
  • Model model: The sidekick model to configure.
  • float core_count: Number of CPU cores to use in this sidekick.
 
sidekick_memory(model: wallaroo.model.Model, memory_spec: str)Sets the memory available to for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
  • Model model: The sidekick model to configure.
  • memory_spec: The amount of memory to allocated as memory unit values. The accepted unit values are:
    • KiB (for KiloBytes)
    • MiB (for MegaBytes)
    • GiB (for GigaBytes)
    • TiB (for TeraBytes)
    The values are similar to the Kubernetes memory resource units format.
 
sidekick_env(model: wallaroo.model.Model, environment: Dict[str, str])Environment variables submitted to the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. These are used specifically for containerized models that have environment variables that effect their performance. 
sidekick_gpus(model: wallaroo.model.Model, core_count: int)Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message will be returned during deployment. If called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance

Deployment Replicas and Autoscale

Wallaroo supports deployment replicas and autoscaling for Wallaroo Enterprise edition. The following parameters are used for different autoscaling options.

MethodParametersDescription
replica_count(count: int)The number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization.
replica_autoscale_min_max(maximum: int, minimum: int = 0)Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs.
autoscale_cpu_utilization(cpu_utilization_percentage: int)Sets the average CPU percentage metric for when to load or unload another replica.
scale_up_queue_depth(queue_depth: int)The queue trigger for autoscaling additional replicas up. This requires the deployment configuration parameter replica_autoscale_min_max is set. scale_up_queue_depth is determined by the formula (number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window). This field overrides the deployment configuration parameter cpu_utilization. The scale_up_queue_depth applies to all resources in the deployment configuration.
scale_down_queue_depth(queue_depth: int), Default: 1Only applies with scale_up_queue_depth is configured. The queue trigger for autoscaling down replicas. The scale_down_queue_depth is based on the formula (number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window).
autoscaling_window(window_seconds: int) (Default: 300, Minimum allowed: 60)The period over which to scale up or scale down resources. Only applies when scale_up_queue_depth is configured.

A replica allocates the same number of cpus, gpus, and memory per replica. The deployment load balancer distributes inference requests across the replicas to provide consistent service. The number of replicas are set in the following mutually exclusive ways:

Replica TypeDescriptionParameters
Set number of replicasSets the number of replicas to a constant number via the replica_count(count) deployment setting. Recommended for use cases where inference requests are steady over time. The number of replicas is changed by creating a new deployment configuration and deploying again with the new deployment configuration.
  • replica_count
Autoscale Triggers by CPU UtilizationThe number of replicas changes depending on the cpu utilization. This is recommended when the inference requests increase or decrease, and provides organizations with a method to decrease the resources needed for a deployment and increase as required.
  • replica_autoscale_min_max(maximum: int, minimum: int = 0)
  • autoscale_cpu_utilization(cpu_utilization_percentage: int)
Autoscale Triggers by Queue DepthAutoscale triggers based on the inference queue depth. Recommended for autoscaling replicas for GPUs and where the inference requests typically increase or decrease over user defined intervals.
  • replica_autoscale_min_max(maximum: int, minimum: int = 0)
  • scale_up_queue_depth
  • scale_down_queue_depth
  • autoscale_cpu_utilization
IMPORTANT NOTE: Autoscale to 0 and CPU Utilization

For use cases where replica_autoscale_min_max has the minimum of 0, cpus must be at least 0.25 or baseline CPU activity will prevent scaling down the final replica to 0.

For example, replica_autoscale_min_max(minimum=0,maximum=5).cpus(0.25).

IMPORTANT NOTE: GPU Allocation and Autoscale

Autoscaling replicas for GPUs should be defined by the queue depth based parameters scale_up_queue_depth, scale_down_queue_depth, and autoscaling_window. In this scenario, scaling is incremental.

Deployment settings, such as the number of gpus, affect the autoscaling feature. The following describes how autoscaling performs based on different use cases.

Replica TypeBehaviorSample Use Case
Static ReplicasThe number of replicas remains constant.Inference requests are a constant over time to provide constant availability.
CPU Default ScalingThe number of replicas increases or decreases with the default autoscale_cpu_utilization of 50%.Inference requests fluctuate over time. This allows organizations to optimize their spend and the number of resources allocated to a deployment.
CPU ScalingThe number of replicas increases or decreases with the autoscale_cpu_utilization of 75%.Inference requests fluctuate over time, with a manually set CPU utilization before allocating another replica.
Autoscale Trigger for GPU, Autoscale Trigger for CPUThe number of replicas increase or decreases based on the scale_up_queue_depth, scale_down_queue_depth(Default: 1), and autoscaling_window(Default: 300) settings.Replicas increase when inference requests meet or exceed the scale_up_queue_depth over the autoscaling_window period and back down based on the scale_down_queue_depth over the autoscaling_window.
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .replica_count(5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .replica_count(5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime. For this example, 1 gpu is assigned to the Containerized Runtime, and 0 gpus assigned to the Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(0.25)
    .memory('1Gi')
    .gpus(0)
    .deployment_label('doc-gpu-label:true')
    .sidekick_gpus(model, 1)
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.

Note that this configuration allocates 2 gpus to the model deployment per replica - one for the Native Runtime and one to the model deployed in the Containerized Runtime. Only one deployment label is required.

wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(2)
    .memory('2Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .sidekick_gpus(model, 1)
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .cpus(4)
    .memory('3Gi')
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()
Resource AllocationBehavior
Sets resources to the LLM llm_gpu with the following allocations:
  • Replica autoscale: 0 to 5.
  • Wallaroo engine per replica:
    • 1 cpu, 2 Gi RAM
  • llm_gpu per replica:
    • 1 gpu
    • 24 Gi RAM
  • scale_up_queue_depth: 5
  • Autoscaling window: 600
  • Wallaroo engine and LLM scaling is 1:1.
  • Scaling up occurs when the scale up queue depth is above 5 over 600 seconds
deployment_with_gpu =wallaroo.DeploymentConfigBuilder()
                       .replica_autoscale_min_max(minimum=0, maximum=5)
                       .cpus(1).memory('2Gi')
                       .sidekick_gpus(llm_gpu, 1)
                       .sidekick_memory(llm_gpu, '24Gi')
                       .scale_up_queue_depth(5)
                       .autoscaling_window(600)
                       .build()

Create the Wallaroo pipeline and assign the LLM as a pipeline step.

# create the pipeline
llm_gpu_pipeline = wl.build_pipeline('sample-llm-with-gpu-pipeline')

# add the LLM as a pipeline model step
llm_gpu_pipeline.add_model_step(llm_gpu)

The pipeline is deployed with the deployment_with_gpu deployment.

llm_gpu_pipeline.deploy(deployment_with_gpu)

Inference from Zero Scaled Deployments

For deployments that autoscale from 0 replicas, replica_autoscale_min_max is set with minimum=0 and replicas scale down to zero when there is no utilization based on the autoscale parameters. When a new inference request is made, the first replica is scaled up. Once the first replica is ready, inference requests proceed as normal.

When inferencing in this scenario, a timeout may occur waiting for the first replica to spool up. To handle situations where an autoscale deployment scales down to zero replicas, the following code example provides a way to “wake up” the pipeline with an inference request which may use mock or real data. Once the first replica is fully spooled up, inference requests proceed at full speed.

Once deployed, we check the pipeline’s deployment status to verify it is running. If the pipeline is still scaling, the process waits 10 seconds to allow it to finish scaling before attempting the initial inference again. Once an inference completes successfully, the inferences proceed as normal.

# verify deployment has the status `Running`
while pipeline.status()["status"] != 'Running':
    try:
        # attempt the inference
        pipeline.infer(dataframe)
    except:
        # if an exception is thrown, pass it
        pass
    # wait 10 seconds before attempting the inference again
    time.sleep(10)

# when the inference passes successfully, continue with other inferences as normal
pipeline.infer(dataframe2)
pipeline.infer(dataframe3)