Wallaroo SDK Essentials Guide: Pipeline Deployment Configuration
Deployment Configuration Introduction
Deployments configurations allow tailoring of a model deployments to match an organization’s and model’s requirements. Deployments may require more memory, CPU cores, or GPUs to run to run all its steps efficiently. Deployment configurations also allow for multiple replicas of a model in a deployment to provide scalability.
Deployment Resource Configurations
Deployment configurations deal with two major components:
- Native Runtimes: Models that are deployed “as is” with the Wallaroo engine (Onnx, etc).
- Containerized Runtimes: Models that are packaged into a container then deployed as a container with the Wallaroo engine (MLFlow, etc).
These configurations can be mixed - both native runtimes and containerized runtimes deployed together, with resources allocated to each runtimes in different configurations.
GPU and CPU Allocation
CPUs are allocated in fractions of total CPU power similar to the Kubernetes CPU definitions. cpus(0.25)
, cpus(1.0)
, etc are valid values.
GPUs can only be allocated by entire integer units from the GPU enabled nodepools. gpus(1)
, gpus(2)
, etc are valid values, while gpus(0.25)
are not.
Organizations should be aware of how many GPUs are allocated to the cluster. If all GPUs are already allocated to other deployments, or if there are not enough GPUs to fulfill the request, the deployment will fail and return an error message.
GPU Support
Wallaroo 2023.2.1 and above supports Kubernetes nodepools with Nvidia Cuda GPUs.
See the Create GPU Nodepools for Kubernetes Clusters guide for instructions on adding GPU enabled nodepools to a Kubernetes cluster.
IMPORTANT NOTE
If allocating GPUs to a Wallaroo pipeline, thedeployment_label
configuration option must be used.Architecture Support
Wallaroo supports x86 and ARM architecture CPUs. For example, Azure supports Ampere® Altra® Arm-based processor included with the following virtual machines:
Model Deployment Architecture Inheritance
Deployment configurations inherit the model’s architecture setting. This is set during model upload by specifying the arch
parameter. By default, models uploaded to Wallaroo default to the x86 architecture.
The following model operations inherit the model’s architecture setting.
- Model Deployment: Model deployment and Model Deployment Deployment Configuration inherit the the model’s architecture. No specification of the architecture is required for model deployment.
- Pipeline Publishing: The Wallaroo engine set when a pipeline is containerized and published to an Open Container Initiative (OCI) Registry inherits the model’s architecture setting.
The following example shows uploading a model set with the architecture set to ARM
, and how the deployment inherits that architecture without additional deployment configuration changes. For this example, an ONNX model is uploaded.
import wallaroo
housing_model_control_arm = (wl.upload_model(model_name_arm,
model_file_name,
framework=Framework.ONNX,
arch=wallaroo.engine_config.Architecture.ARM)
.configure(tensor_fields=["tensor"])
)
display(housing_model_control_arm)
Name | house-price-estimator-arm |
Version | 163ff0a9-0f1a-4229-bbf2-a19e4385f10f |
File Name | rf_model.onnx |
SHA | e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6 |
Status | ready |
Image Path | None |
Architecture | arm |
Acceleration | None |
Updated At | 2024-04-Mar 20:34:00 |
Note that the deployment configuration settings, no architecture is specified. When pipeline_arm
is displayed, we see the arch
setting inherited the model’s arch
setting.
pipeline_arm = wl.build_pipeline(arm_pipeline_name)
# set the model step with the ARM targeted model
pipeline_arm.add_model_step(housing_model_control_arm)
#minimum deployment config for this model
deploy_config = wallaroo.DeploymentConfigBuilder().replica_count(1).cpus(1).memory("1Gi").build()
pipeline_arm.deploy(deployment_config = deploy_config)
Waiting for deployment - this will take up to 45s .......... ok
display(pipeline_arm)
name | architecture-demonstration-arm |
---|---|
created | 2024-03-04 20:34:08.895396+00:00 |
last_updated | 2024-03-04 21:52:01.894671+00:00 |
deployed | True |
arch | arm |
accel | None |
tags | |
versions | 55d834b4-92c8-4a93-b78b-6a224f17f9c1, 98821b85-401a-4ab5-af8e-1b3126727069, 74571863-9eb0-47aa-8b5a-3bdaa7aa9f03, b72fb0db-e4b4-4936-a7cb-3d0fb7827a6f, 3ae70818-10f3-4f61-a998-dee5e2f00daf |
steps | house-price-estimator-arm |
published | True |
Deployment Configuration Defaults
Deployment configurations default to the following*.
Runtime | CPUs | Memory | GPUs |
---|---|---|---|
Wallaroo Native Runtime** | 4 | 3 Gi | 0 |
Wallaroo Containerized Runtime*** | 2 | 1 Gi | 0 |
*: For Kubernetes limits and requests.
**: Resources are always allocated for the Wallaroo Native Runtime engine even if there are no Wallaroo Native Runtimes included in the deployment, so it is recommended to decrease these resources when pipelines use Containerized Runtimes.
***: Resources for Wallaroo Containerized Runtimes only apply with a Wallaroo Containerized Runtime is part of the deployment.
Deployment Configurations via the Wallaroo SDK
The following details how to set the deployment configuration via the Wallaroo SDK.
The following resources configurations are available through the wallaroo.deployment_config
object.
These updates can be edited in the Wallaroo Dashboard after the initial deployment. For more details, see Deployment Configuration via the Wallaroo Dashboard.
Create Deployment Configuration
Setting a deployment configuration follows this process:
- Pipeline deployment configurations are created through the
wallaroo.deployment_config.DeploymentConfigBuilder()
class. - Once the configuration options are set the deployment configuration is set with the
wallaroo.deployment_config.DeploymentConfigBuilder().build()
method. - The deployment configuration is applied when applied during model deployment via the
wallaroo.pipeline.Pipeline.deploy
method.
The following example shows a model deployment configuration with 1 replica, 1 cpu, and 2Gi
of memory set to be allocated to the deployment configuration. We start by:
- Importing the
DeploymentConfigBuilder
class - Setting the deployment configuration settings
- Building the deployment configuration and saving it to a variable
- Applying that deployment configuration when we deploy the pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
deployment_config = wallaroo.DeploymentConfigBuilder()
.replica_count(1)
.cpus(1)
.memory("2Gi")
.build()
pipeline.deploy(deployment_config = deployment_config)
Deployment resources can be configured with autoscaling. Autoscaling allows the user to define how many engines a deployment starts with, the minimum amount of engines a deployment uses, and the maximum amount of engines a deployment can scale to. The deployment scales up and down based on the average CPU utilization across the engines in a given deployment as the user’s workload increases and decreases.
Native Runtime Configuration Methods
Method | Parameters | Description | Enterprise Only Feature |
---|---|---|---|
cpus | (core_count: float) | Sets the number or fraction of CPUs to use for the deployment, for example: 0.25 , 1 , 1.5 , etc. The units are similar to the Kubernetes CPU definitions. | |
gpus | (core_count: int) | Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If gpus is called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance. | √ |
memory | (memory_spec: str) | Sets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
| |
deployment_label | (label: string) | Label used to match the nodepool label used for the deployment. Required if gpus are set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo. | √ |
Containerized Runtime Configuration Methods
Method | Parameters | Description | Enterprise Only Feature |
---|---|---|---|
sidekick_cpus | (model: wallaroo.model.Model, core_count: float) | Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
| |
sidekick_memory | (model: wallaroo.model.Model, memory_spec: str) | Sets the memory available to for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
| |
sidekick_env | (model: wallaroo.model.Model, environment: Dict[str, str]) | Environment variables submitted to the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. These are used specifically for containerized models that have environment variables that effect their performance. | |
sidekick_gpus | (model: wallaroo.model.Model, core_count: int) | Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message will be returned during deployment. If called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance | √ |
Deployment Replicas and Autoscale
Wallaroo supports deployment replicas and autoscaling for Wallaroo Enterprise edition. The following parameters are used for different autoscaling options.
Method | Parameters | Description |
---|---|---|
replica_count | (count: int) | The number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization. |
replica_autoscale_min_max | (maximum: int, minimum: int = 0) | Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs. |
autoscale_cpu_utilization | (cpu_utilization_percentage: int) | Sets the average CPU percentage metric for when to load or unload another replica. |
scale_up_queue_depth | (queue_depth: int) | The queue trigger for autoscaling additional replicas up. This requires the deployment configuration parameter replica_autoscale_min_max is set. scale_up_queue_depth is determined by the formula (number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window) . This field overrides the deployment configuration parameter cpu_utilization . The scale_up_queue_depth applies to all resources in the deployment configuration. |
scale_down_queue_depth | (queue_depth: int) , Default: 1 | Only applies with scale_up_queue_depth is configured. The queue trigger for autoscaling down replicas. The scale_down_queue_depth is based on the formula (number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window) . |
autoscaling_window | (window_seconds: int) (Default: 300, Minimum allowed: 60) | The period over which to scale up or scale down resources. Only applies when scale_up_queue_depth is configured. |
A replica allocates the same number of cpus, gpus, and memory per replica. The deployment load balancer distributes inference requests across the replicas to provide consistent service. The number of replicas are set in the following mutually exclusive ways:
Replica Type | Description | Parameters |
---|---|---|
Set number of replicas | Sets the number of replicas to a constant number via the replica_count(count) deployment setting. Recommended for use cases where inference requests are steady over time. The number of replicas is changed by creating a new deployment configuration and deploying again with the new deployment configuration. |
|
Autoscale Triggers by CPU Utilization | The number of replicas changes depending on the cpu utilization. This is recommended when the inference requests increase or decrease, and provides organizations with a method to decrease the resources needed for a deployment and increase as required. |
|
Autoscale Triggers by Queue Depth | Autoscale triggers based on the inference queue depth. Recommended for autoscaling replicas for GPUs and where the inference requests typically increase or decrease over user defined intervals. |
|
IMPORTANT NOTE: Autoscale to 0 and CPU Utilization
For use cases where replica_autoscale_min_max
has the minimum of 0
, cpus
must be at least 0.25
or baseline CPU activity will prevent scaling down the final replica to 0.
For example, replica_autoscale_min_max(minimum=0,maximum=5).cpus(0.25)
.
IMPORTANT NOTE: GPU Allocation and Autoscale
Autoscaling replicas for GPUs should be defined by the queue depth based parameters scale_up_queue_depth
, scale_down_queue_depth
, and autoscaling_window
. In this scenario, scaling is incremental.
Deployment settings, such as the number of gpus, affect the autoscaling feature. The following describes how autoscaling performs based on different use cases.
Replica Type | Behavior | Sample Use Case |
---|---|---|
Static Replicas | The number of replicas remains constant. | Inference requests are a constant over time to provide constant availability. |
CPU Default Scaling | The number of replicas increases or decreases with the default autoscale_cpu_utilization of 50%. | Inference requests fluctuate over time. This allows organizations to optimize their spend and the number of resources allocated to a deployment. |
CPU Scaling | The number of replicas increases or decreases with the autoscale_cpu_utilization of 75%. | Inference requests fluctuate over time, with a manually set CPU utilization before allocating another replica. |
Autoscale Trigger for GPU, Autoscale Trigger for CPU | The number of replicas increase or decreases based on the scale_up_queue_depth , scale_down_queue_depth(Default: 1) , and autoscaling_window(Default: 300) settings. | Replicas increase when inference requests meet or exceed the scale_up_queue_depth over the autoscaling_window period and back down based on the scale_down_queue_depth over the autoscaling_window . |
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.cpus(4)
.memory('3Gi')
.replica_autoscale_min_max(minimum=0, maximum=5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(0.25)
.memory('1Gi')
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(2)
.memory('2Gi')
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.build()
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.cpus(4)
.memory('3Gi')
.replica_count(5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
.replica_count(5)
.cpus(0.25)
.memory('1Gi')
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
.replica_count(5)
.cpus(2)
.memory('2Gi')
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.build()
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.cpus(4)
.memory('3Gi')
.gpus(1)
.deployment_label('doc-gpu-label:true')
.replica_count(5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime. For this example, 1 gpu is assigned to the Containerized Runtime, and 0 gpus assigned to the Native Runtime.
wallaroo.DeploymentConfigBuilder()
.replica_count(5)
.cpus(0.25)
.memory('1Gi')
.gpus(0)
.deployment_label('doc-gpu-label:true')
.sidekick_gpus(model, 1)
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
Note that this configuration allocates 2 gpus to the model deployment per replica - one for the Native Runtime and one to the model deployed in the Containerized Runtime. Only one deployment label is required.
wallaroo.DeploymentConfigBuilder()
.replica_count(5)
.cpus(2)
.memory('2Gi')
.gpus(1)
.deployment_label('doc-gpu-label:true')
.sidekick_gpus(model, 1)
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.build()
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.cpus(4)
.memory('3Gi')
.replica_autoscale_min_max(minimum=0, maximum=5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(0.25)
.memory('1Gi')
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(2)
.memory('2Gi')
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.build()
Resource Allocation | Behavior |
---|---|
Sets resources to the LLM llm_gpu with the following allocations:
|
|
deployment_with_gpu =wallaroo.DeploymentConfigBuilder()
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(1).memory('2Gi')
.sidekick_gpus(llm_gpu, 1)
.sidekick_memory(llm_gpu, '24Gi')
.scale_up_queue_depth(5)
.autoscaling_window(600)
.build()
Create the Wallaroo pipeline and assign the LLM as a pipeline step.
# create the pipeline
llm_gpu_pipeline = wl.build_pipeline('sample-llm-with-gpu-pipeline')
# add the LLM as a pipeline model step
llm_gpu_pipeline.add_model_step(llm_gpu)
The pipeline is deployed with the deployment_with_gpu
deployment.
llm_gpu_pipeline.deploy(deployment_with_gpu)
Inference from Zero Scaled Deployments
For deployments that autoscale from 0 replicas, replica_autoscale_min_max
is set with minimum=0
and replicas scale down to zero when there is no utilization based on the autoscale parameters. When a new inference request is made, the first replica is scaled up. Once the first replica is ready, inference requests proceed as normal.
When inferencing in this scenario, a timeout may occur waiting for the first replica to spool up. To handle situations where an autoscale deployment scales down to zero replicas, the following code example provides a way to “wake up” the pipeline with an inference request which may use mock or real data. Once the first replica is fully spooled up, inference requests proceed at full speed.
Once deployed, we check the pipeline’s deployment status to verify it is running. If the pipeline is still scaling, the process waits 10 seconds to allow it to finish scaling before attempting the initial inference again. Once an inference completes successfully, the inferences proceed as normal.
# verify deployment has the status `Running`
while pipeline.status()["status"] != 'Running':
try:
# attempt the inference
pipeline.infer(dataframe)
except:
# if an exception is thrown, pass it
pass
# wait 10 seconds before attempting the inference again
time.sleep(10)
# when the inference passes successfully, continue with other inferences as normal
pipeline.infer(dataframe2)
pipeline.infer(dataframe3)