Deployment Configuration
Deployments configurations allow tailoring of a model deployments to match an organization’s and model’s requirements. Deployments may require more memory, CPU cores, or GPUs to run to run all its steps efficiently. Deployment configurations also allow for multiple replicas of a model in a deployment to provide scalability.
Create Deployment Configuration
Setting a deployment configuration follows this process:
- Pipeline deployment configurations are created through the
wallaroo.deployment_config.DeploymentConfigBuilder()
class. - Once the configuration options are set the deployment configuration is set with the
wallaroo.deployment_config.DeploymentConfigBuilder().build()
method. - The deployment configuration is applied when applied during model deployment via the
wallaroo.pipeline.Pipeline.deploy
method.
The following example shows a model deployment configuration with 1 replica, 1 cpu, and 2Gi
of memory set to be allocated to the deployment configuration. We start by:
- Importing the
DeploymentConfigBuilder
class - Setting the deployment configuration settings
- Building the deployment configuration and saving it to a variable
- Applying that deployment configuration when we deploy the pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
deployment_config = wallaroo.DeploymentConfigBuilder()
.replica_count(1)
.cpus(1)
.memory("2Gi")
.build()
pipeline.deploy(deployment_config = deployment_config)
Deployment resources can be configured with autoscaling. Autoscaling allows the user to define how many engines a deployment starts with, the minimum amount of engines a deployment uses, and the maximum amount of engines a deployment can scale to. The deployment scales up and down based on the average CPU utilization across the engines in a given deployment as the user’s workload increases and decreases.
Deployment Resource Configurations
Deployment configurations deal with two major components:
- Native Runtimes: Models that are deployed “as is” with the Wallaroo engine (Onnx, etc).
- Containerized Runtimes: Models that are packaged into a container then deployed as a container with the Wallaroo engine (MLFlow, etc).
These configurations can be mixed - both native runtimes and containerized runtimes deployed together, with resources allocated to each runtimes in different configurations.
The following resources configurations are available through the wallaroo.deployment_config
object.
GPU and CPU Allocation
CPUs are allocated in fractions of total CPU power similar to the Kubernetes CPU definitions. cpus(0.25)
, cpus(1.0)
, etc are valid values.
GPUs can only be allocated by entire integer units from the GPU enabled nodepools. gpus(1)
, gpus(2)
, etc are valid values, while gpus(0.25)
are not.
Organizations should be aware of how many GPUs are allocated to the cluster. If all GPUs are already allocated to other deployments, or if there are not enough GPUs to fulfill the request, the deployment will fail and return an error message.
GPU Support
Wallaroo 2023.2.1 and above supports Kubernetes nodepools with Nvidia Cuda GPUs.
See the Create GPU Nodepools for Kubernetes Clusters guide for instructions on adding GPU enabled nodepools to a Kubernetes cluster.
IMPORTANT NOTE
If allocating GPUs to a Wallaroo pipeline, thedeployment_label
configuration option must be used.Architecture Support
Wallaroo supports x86 and ARM architecture CPUs. For example, Azure supports Ampere® Altra® Arm-based processor included with the following virtual machines:
Model Deployment Architecture Inheritance
Deployment configurations inherit the model’s architecture setting. This is set during model upload by specifying the arch
parameter. By default, models uploaded to Wallaroo default to the x86 architecture.
The following model operations inherit the model’s architecture setting.
- Model Deployment: Model deployment and Model Deployment Deployment Configuration inherit the the model’s architecture. No specification of the architecture is required for model deployment.
- Pipeline Publishing: The Wallaroo engine set when a pipeline is containerized and published to an Open Container Initiative (OCI) Registry inherits the model’s architecture setting.
The following example shows uploading a model set with the architecture set to ARM
, and how the deployment inherits that architecture without additional deployment configuration changes. For this example, an ONNX model is uploaded.
import wallaroo
housing_model_control_arm = (wl.upload_model(model_name_arm,
model_file_name,
framework=Framework.ONNX,
arch=wallaroo.engine_config.Architecture.ARM)
.configure(tensor_fields=["tensor"])
)
display(housing_model_control_arm)
Name | house-price-estimator-arm |
Version | 163ff0a9-0f1a-4229-bbf2-a19e4385f10f |
File Name | rf_model.onnx |
SHA | e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6 |
Status | ready |
Image Path | None |
Architecture | arm |
Acceleration | None |
Updated At | 2024-04-Mar 20:34:00 |
Note that the deployment configuration settings, no architecture is specified. When pipeline_arm
is displayed, we see the arch
setting inherited the model’s arch
setting.
pipeline_arm = wl.build_pipeline(arm_pipeline_name)
# set the model step with the ARM targeted model
pipeline_arm.add_model_step(housing_model_control_arm)
#minimum deployment config for this model
deploy_config = wallaroo.DeploymentConfigBuilder().replica_count(1).cpus(1).memory("1Gi").build()
pipeline_arm.deploy(deployment_config = deploy_config)
Waiting for deployment - this will take up to 45s .......... ok
display(pipeline_arm)
name | architecture-demonstration-arm |
---|---|
created | 2024-03-04 20:34:08.895396+00:00 |
last_updated | 2024-03-04 21:52:01.894671+00:00 |
deployed | True |
arch | arm |
accel | None |
tags | |
versions | 55d834b4-92c8-4a93-b78b-6a224f17f9c1, 98821b85-401a-4ab5-af8e-1b3126727069, 74571863-9eb0-47aa-8b5a-3bdaa7aa9f03, b72fb0db-e4b4-4936-a7cb-3d0fb7827a6f, 3ae70818-10f3-4f61-a998-dee5e2f00daf |
steps | house-price-estimator-arm |
published | True |
pipeline_arm.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.124.0.45',
'name': 'engine-5d94d89b5d-gbr9h',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'architecture-demonstration-arm',
'status': 'Running'}]},
'model_statuses': {'models': [{'config': {'batch_config': None,
'filter_threshold': None,
'id': 76,
'input_schema': None,
'model_version_id': 43,
'output_schema': None,
'runtime': 'onnx',
'sidekick_uri': None,
'tensor_fields': ['tensor']},
'model_version': {'conversion': {'arch': 'arm',
'framework': 'onnx',
'python_version': '3.8',
'requirements': []},
'file_info': {'file_name': 'rf_model.onnx',
'sha': 'e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6',
'version': '163ff0a9-0f1a-4229-bbf2-a19e4385f10f'},
'id': 43,
'image_path': None,
'name': 'house-price-estimator-arm',
'status': 'ready',
'task_id': None,
'visibility': 'private',
'workspace_id': 62},
'status': 'Running'}]}}],
'engine_lbs': [{'ip': '10.124.0.44',
'name': 'engine-lb-d7cc8fc9c-4s9fc',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': []}
Deployment Configuration Defaults
Deployment configurations default to the following*.
Runtime | CPUs | Memory | GPUs |
---|---|---|---|
Wallaroo Native Runtime** | 4 | 3 Gi | 0 |
Wallaroo Containerized Runtime*** | 2 | 1 Gi | 0 |
*: For Kubernetes limits and requests.
**: Resources are always allocated for the Wallaroo Native Runtime engine even if there are no Wallaroo Native Runtimes included in the deployment, so it is recommended to decrease these resources when pipelines use Containerized Runtimes.
***: Resources for Wallaroo Containerized Runtimes only apply with a Wallaroo Containerized Runtime is part of the deployment.
Native Runtime Configuration Methods
Method | Parameters | Description | Enterprise Only Feature |
---|---|---|---|
replica_count | (count: int) | The number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization. | √ |
replica_autoscale_min_max | (maximum: int, minimum: int = 0) | Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs. | √ |
autoscale_cpu_utilization | (cpu_utilization_percentage: int) | Sets the average CPU percentage metric for when to load or unload another replica. | √ |
disable_autoscale | Disables autoscaling in the deployment configuration. | ||
cpus | (core_count: float) | Sets the number or fraction of CPUs to use for the deployment, for example: 0.25 , 1 , 1.5 , etc. The units are similar to the Kubernetes CPU definitions. | |
gpus | (core_count: int) | Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If gpus is called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance. | √ |
memory | (memory_spec: str) | Sets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
| |
lb_cpus | (core_count: float) | Sets the number or fraction of CPUs to use for the deployment’s load balancer, for example: 0.25 , 1 , 1.5 , etc. The units, similar to the Kubernetes CPU definitions. | |
lb_memory | (memory_spec: str) | Sets the amount of RAM to allocate the deployments’s load balancer. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
| |
deployment_label | (label: string) | Label used to match the nodepool label used for the deployment. Required if gpus are set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo. | √ |
arch | architecture: wallaroo.engine_config.Architecture | Sets the CPU infrastructure for model deployment. IMPORTANT NOTE: Model architecture should be set during the model upload process. Deployment configurations inherit the model architecture setting by default, and changes from the model’s architecture setting in the deployment configuration are discouraged. For more details, see Model Uploads and Registrations. Available options are:
| √ |
accel | architecture: wallaroo.engine_config.Acceleration | The AI hardware accelerator used. If a model is intended for use with a hardware accelerator, it should be assigned at this step.
| √ |
Containerized Runtime Configuration Methods
Method | Parameters | Description | Enterprise Only Feature |
---|---|---|---|
sidekick_cpus | (model: wallaroo.model.Model, core_count: float) | Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
| |
sidekick_memory | (model: wallaroo.model.Model, memory_spec: str) | Sets the memory available to for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
| |
sidekick_env | (model: wallaroo.model.Model, environment: Dict[str, str]) | Environment variables submitted to the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. These are used specifically for containerized models that have environment variables that effect their performance. | |
sidekick_gpus | (model: wallaroo.model.Model, core_count: int) | Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message will be returned during deployment. If called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance | √ |
sidekick_arch | architecture: wallaroo.engine_config.Architecture | Sets the CPU architecture for the deployment. This defaults to X86. Available options are:
| √ |
Deployment Replicas and Autoscale
Wallaroo supports deployment replicas and autoscaling.
A replica allocates the same number of cpus, gpus, and memory per replica . The deployment load balancer distributes inference requests across the replicas to provide consistent service. The number of replicas are set in one of two mutually exclusive ways:
- Set number of replicas: Sets the number of replicas to a constant number via the
replica_count(count)
deployment setting. This is recommended for use cases where inference requests are steady over time. The number of replicas can be changed by creating a new deployment configuration and deploying again with the new deployment configuration. - Autoscale: The number of replicas changes depending on the cpu utilization. This is recommended when the inference requests increase or decrease, and provides organizations with a method to decrease the resources needed for a deployment and increase as required.
Autoscale replicas are controlled via the following deployment configuration settings.
replica_autoscale_min_max(maximum: int, minimum: int = 0)
: Sets the minimum and maximum replicas to deploy based on the cpu utilization setting.autoscale_cpu_utilization(cpu_utilization_percentage: int)
(Default:50
): When the CPU utilization reaches thecpu_utilization_percentage
across all replicas in aggregate, a new replica is deployed. If CPU utilization across all replicas remains at that level, another replica is deployed until the number of replicas meet the maximumreplica_autoscale_min_max
values are reached. The number of replicas will remain deployed at this number until CPU utilization goes under theautoscale_cpu_utilization
value for 60 minutes, then it reduces one replica at a time as long as CPU utilization remains under thecpu_utilization_percentage
value across all replicas in aggregate.
IMPORTANT NOTE: Autoscale to 0 and CPU Utilization
For use cases where replica_autoscale_min_max
has the minimum of 0
, cpus
must be at least 0.25
or baseline CPU activity will prevent scaling down the final replica to 0.
For example, replica_autoscale_min_max(minimum=0,maximum=5).cpus(0.25)
.
IMPORTANT NOTE: GPU Allocation and Autoscale
Allocating GPUs to a deployment configuration allocates either the replica_autoscale_min_max
minimum number of replicas or the maximum number of replicas for any inference traffic.
For example, for replica_autoscale_min_max(minimum=0,maximum=5).gpus(1)
is applied, when any inference request is received, the number of replicas allocates from 0
replicas to the replica_autoscale_min_max
’s maximum
value of 5
.
When all inference requests cease for 60 minutes, the replica allocations drop back down to the replica_autoscale_min_max
’s minimum
setting of 0
.
Deployment settings, such as the number of gpus, affect the autoscaling feature. The following describes how autoscaling performs based on different use cases.
Replica Type | Behavior | Sample Use Case |
---|---|---|
Static Replicas | The number of replicas remains constant. | Inference requests are a constant over time to provide constant availability. |
CPU Default Scaling | The number of replicas increases or decreases with the default autoscale_cpu_utilization of 50%. | Inference requests fluctuate over time. This allows organizations to optimize their spend and the number of resources allocated to a deployment. |
CPU Scaling | The number of replicas increases or decreases with the autoscale_cpu_utilization of 75%. | Inference requests fluctuate over time, with a manually set CPU utilization before allocating another replica. |
GPU Scaling | The number of replicas go from the replica_autoscale_min_max minimum value to the maximum value for any inference requests until all inference requests cease for 60 minutes. | Inference requests fluctuate over time. Any inference request increases the number of replicas from the replica_autoscale_min_max ’s minimum value to the replica_autoscale_min_max ’s maximum replica value to provide full availability during that period. After all inference requests cease for 60 minutes, the number of replicas are reduced to the replica_autoscale_min_max ’s minimum value. |
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.cpus(4)
.memory('3Gi')
.replica_autoscale_min_max(minimum=0, maximum=5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(0.25)
.memory('1Gi')
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(2)
.memory('2Gi')
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.build()
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.cpus(4)
.memory('3Gi')
.replica_count(5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
.replica_count(5)
.cpus(0.25)
.memory('1Gi')
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
.replica_count(5)
.cpus(2)
.memory('2Gi')
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.build()
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.cpus(4)
.memory('3Gi')
.gpus(1)
.deployment_label('doc-gpu-label:true')
.replica_count(5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime. For this example, 1 gpu is assigned to the Containerized Runtime, and 0 gpus assigned to the Native Runtime.
wallaroo.DeploymentConfigBuilder()
.replica_count(5)
.cpus(0.25)
.memory('1Gi')
.gpus(0)
.deployment_label('doc-gpu-label:true')
.sidekick_gpus(model, 1)
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
Note that this configuration allocates 2 gpus to the model deployment per replica - one for the Native Runtime and one to the model deployed in the Containerized Runtime. Only one deployment label is required.
wallaroo.DeploymentConfigBuilder()
.replica_count(5)
.cpus(2)
.memory('2Gi')
.gpus(1)
.deployment_label('doc-gpu-label:true')
.sidekick_gpus(model, 1)
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.build()
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.cpus(4)
.memory('3Gi')
.replica_autoscale_min_max(minimum=0, maximum=5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(0.25)
.memory('1Gi')
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(2)
.memory('2Gi')
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.build()
- Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.cpus(4)
.memory('3Gi')
.gpus(1)
.deployment_label('doc-gpu-label:true')
.replica_autoscale_min_max(minimum=0, maximum=5)
.build()
- Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(0.25)
.memory('1Gi')
.gpus(0)
.sidekick_cpus(model, 4)
.sidekick_memory(model, '3Gi')
.sidekick_gpus(model,1)
.deployment_label('doc-gpu-label:true')
.build()
- Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
Note: This deployment will use 2 GPUs - one for the native runtime, one for the containerized runtime. Only one deployment label is required.
wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.replica_autoscale_min_max(minimum=0, maximum=5)
.gpus(1)
.cpus(2)
.memory('2Gi')
.sidekick_cpus(model, 3)
.sidekick_memory(model, '3Gi')
.sidekick_gpus(model, 1)
.deployment_label('doc-gpu-label:true')
.build()
Inference from Zero Scaled Deployments
For deployments that autoscale from 0 replicas, replica_autoscale_min_max
is set with minimum=0
and replicas scale down to zero when there is no utilization for 60 minutes. When a new inference request is made, the first replica is scaled up. Once the first replica is ready, inference requests proceed as normal.
When inferencing in this scenario, including within ML Workload Orchestrations, a timeout may occur waiting for the first replica to finish spooling up. To handle situations where an autoscale deployment scales down to zero replicas, the following code example provides a way to “wake up” the pipeline with an inference request which may use mock or real data. Once the first replica is fully spooled up, inference requests proceed at full speed.
The following deploys a pipeline with 4 cpus and 3 GB RAM per replica, with the autoscale set between 0
and 5
replicas.
Once deployed, we check the pipeline’s deployment status to verify it is running. If the pipeline is still scaling, the process waits 10 seconds to allow it to finish scaling. Once an inference completes successfully, the inferences proceed as normal.
# deployment configuration with autoscaling between 0 and 5 replicas
deployment_configuration = wallaroo.DeploymentConfigBuilder()
.autoscale_cpu_utilization(75)
.cpus(4)
.memory('3Gi')
.replica_autoscale_min_max(minimum=0, maximum=5)
.build()
# deployment with the deployment configuration
pipeline.deploy(deployment_configuration)
# verify deployment has the status `Running`
while pipeline.status()["status"] != 'Running':
try:
# attempt the inference
pipeline.infer(dataframe)
except:
# if an exception is thrown, pass it
pass
# wait 10 seconds before attempting the inference again
time.sleep(10)
# when the inference passes successfully, continue with other inferences as normal
pipeline.infer(dataframe2)
pipeline.infer(dataframe3)
...