Deployment Configuration

How to manage deployment configurations

Deployments configurations allow tailoring of a model deployments to match an organization’s and model’s requirements. Deployments may require more memory, CPU cores, or GPUs to run to run all its steps efficiently. Deployment configurations also allow for multiple replicas of a model in a deployment to provide scalability.

Create Deployment Configuration

Setting a deployment configuration follows this process:

  1. Pipeline deployment configurations are created through the wallaroo.deployment_config.DeploymentConfigBuilder() class.
  2. Once the configuration options are set the deployment configuration is set with the wallaroo.deployment_config.DeploymentConfigBuilder().build() method.
  3. The deployment configuration is applied when applied during model deployment via the wallaroo.pipeline.Pipeline.deploy method.

The following example shows a model deployment configuration with 1 replica, 1 cpu, and 2Gi of memory set to be allocated to the deployment configuration. We start by:

  • Importing the DeploymentConfigBuilder class
  • Setting the deployment configuration settings
  • Building the deployment configuration and saving it to a variable
  • Applying that deployment configuration when we deploy the pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder

deployment_config = wallaroo.DeploymentConfigBuilder()
                    .replica_count(1)
                    .cpus(1)
                    .memory("2Gi")
                    .build()

pipeline.deploy(deployment_config = deployment_config)

Deployment resources can be configured with autoscaling. Autoscaling allows the user to define how many engines a deployment starts with, the minimum amount of engines a deployment uses, and the maximum amount of engines a deployment can scale to. The deployment scales up and down based on the average CPU utilization across the engines in a given deployment as the user’s workload increases and decreases.

Deployment Resource Configurations

Deployment configurations deal with two major components:

  • Native Runtimes: Models that are deployed “as is” with the Wallaroo engine (Onnx, etc).
  • Containerized Runtimes: Models that are packaged into a container then deployed as a container with the Wallaroo engine (MLFlow, etc).

These configurations can be mixed - both native runtimes and containerized runtimes deployed together, with resources allocated to each runtimes in different configurations.

The following resources configurations are available through the wallaroo.deployment_config object.

GPU and CPU Allocation

CPUs are allocated in fractions of total CPU power similar to the Kubernetes CPU definitions. cpus(0.25), cpus(1.0), etc are valid values.

GPUs can only be allocated by entire integer units from the GPU enabled nodepools. gpus(1), gpus(2), etc are valid values, while gpus(0.25) are not.

Organizations should be aware of how many GPUs are allocated to the cluster. If all GPUs are already allocated to other deployments, or if there are not enough GPUs to fulfill the request, the deployment will fail and return an error message.

GPU Support

Wallaroo 2023.2.1 and above supports Kubernetes nodepools with Nvidia Cuda GPUs.

See the Create GPU Nodepools for Kubernetes Clusters guide for instructions on adding GPU enabled nodepools to a Kubernetes cluster.

Architecture Support

Wallaroo supports x86 and ARM architecture CPUs. For example, Azure supports Ampere® Altra® Arm-based processor included with the following virtual machines:

Model Deployment Architecture Inheritance

Deployment configurations inherit the model’s architecture setting. This is set during model upload by specifying the arch parameter. By default, models uploaded to Wallaroo default to the x86 architecture.

The following model operations inherit the model’s architecture setting.

The following example shows uploading a model set with the architecture set to ARM, and how the deployment inherits that architecture without additional deployment configuration changes. For this example, an ONNX model is uploaded.

import wallaroo

housing_model_control_arm = (wl.upload_model(model_name_arm, 
                                        model_file_name, 
                                        framework=Framework.ONNX,
                                        arch=wallaroo.engine_config.Architecture.ARM)
                                        .configure(tensor_fields=["tensor"])
                        )

display(housing_model_control_arm)
Namehouse-price-estimator-arm
Version163ff0a9-0f1a-4229-bbf2-a19e4385f10f
File Namerf_model.onnx
SHAe22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6
Statusready
Image PathNone
Architecturearm
AccelerationNone
Updated At2024-04-Mar 20:34:00

Note that the deployment configuration settings, no architecture is specified. When pipeline_arm is displayed, we see the arch setting inherited the model’s arch setting.

pipeline_arm = wl.build_pipeline(arm_pipeline_name)

# set the model step with the ARM targeted model
pipeline_arm.add_model_step(housing_model_control_arm)

#minimum deployment config for this model
deploy_config = wallaroo.DeploymentConfigBuilder().replica_count(1).cpus(1).memory("1Gi").build()

pipeline_arm.deploy(deployment_config = deploy_config)

    Waiting for deployment - this will take up to 45s .......... ok

display(pipeline_arm)
namearchitecture-demonstration-arm
created2024-03-04 20:34:08.895396+00:00
last_updated2024-03-04 21:52:01.894671+00:00
deployedTrue
archarm
accelNone
tags
versions55d834b4-92c8-4a93-b78b-6a224f17f9c1, 98821b85-401a-4ab5-af8e-1b3126727069, 74571863-9eb0-47aa-8b5a-3bdaa7aa9f03, b72fb0db-e4b4-4936-a7cb-3d0fb7827a6f, 3ae70818-10f3-4f61-a998-dee5e2f00daf
stepshouse-price-estimator-arm
publishedTrue
pipeline_arm.status()

    {'status': 'Running',
     'details': [],
     'engines': [{'ip': '10.124.0.45',
       'name': 'engine-5d94d89b5d-gbr9h',
       'status': 'Running',
       'reason': None,
       'details': [],
       'pipeline_statuses': {'pipelines': [{'id': 'architecture-demonstration-arm',
          'status': 'Running'}]},
       'model_statuses': {'models': [{'config': {'batch_config': None,
           'filter_threshold': None,
           'id': 76,
           'input_schema': None,
           'model_version_id': 43,
           'output_schema': None,
           'runtime': 'onnx',
           'sidekick_uri': None,
           'tensor_fields': ['tensor']},
          'model_version': {'conversion': {'arch': 'arm',
            'framework': 'onnx',
            'python_version': '3.8',
            'requirements': []},
           'file_info': {'file_name': 'rf_model.onnx',
            'sha': 'e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6',
            'version': '163ff0a9-0f1a-4229-bbf2-a19e4385f10f'},
           'id': 43,
           'image_path': None,
           'name': 'house-price-estimator-arm',
           'status': 'ready',
           'task_id': None,
           'visibility': 'private',
           'workspace_id': 62},
          'status': 'Running'}]}}],
     'engine_lbs': [{'ip': '10.124.0.44',
       'name': 'engine-lb-d7cc8fc9c-4s9fc',
       'status': 'Running',
       'reason': None,
       'details': []}],
     'sidekicks': []}

Deployment Configuration Defaults

Deployment configurations default to the following*.

RuntimeCPUsMemoryGPUs
Wallaroo Native Runtime**43 Gi0
Wallaroo Containerized Runtime***21 Gi0

*: For Kubernetes limits and requests.
**: Resources are always allocated for the Wallaroo Native Runtime engine even if there are no Wallaroo Native Runtimes included in the deployment, so it is recommended to decrease these resources when pipelines use Containerized Runtimes.
***: Resources for Wallaroo Containerized Runtimes only apply with a Wallaroo Containerized Runtime is part of the deployment.

Native Runtime Configuration Methods

MethodParametersDescriptionEnterprise Only Feature
replica_count(count: int)The number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization.
replica_autoscale_min_max(maximum: int, minimum: int = 0)Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs.
autoscale_cpu_utilization(cpu_utilization_percentage: int)Sets the average CPU percentage metric for when to load or unload another replica.
disable_autoscaleDisables autoscaling in the deployment configuration. 
cpus(core_count: float)Sets the number or fraction of CPUs to use for the deployment, for example: 0.25, 1, 1.5, etc. The units are similar to the Kubernetes CPU definitions. 
gpus(core_count: int)Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If gpus is called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance.
memory(memory_spec: str)Sets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
  • KiB (for KiloBytes)
  • MiB (for MegaBytes)
  • GiB (for GigaBytes)
  • TiB (for TeraBytes)
The values are similar to the Kubernetes memory resource units format.
 
lb_cpus(core_count: float)Sets the number or fraction of CPUs to use for the deployment’s load balancer, for example: 0.25, 1, 1.5, etc. The units, similar to the Kubernetes CPU definitions. 
lb_memory(memory_spec: str)Sets the amount of RAM to allocate the deployments’s load balancer. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
  • KiB (for KiloBytes)
  • MiB (for MegaBytes)
  • GiB (for GigaBytes)
  • TiB (for TeraBytes)
The values are similar to the Kubernetes memory resource units format.
 
deployment_label(label: string)Label used to match the nodepool label used for the deployment. Required if gpus are set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo.
archarchitecture: wallaroo.engine_config.ArchitectureSets the CPU infrastructure for model deployment.
IMPORTANT NOTE: Model architecture should be set during the model upload process. Deployment configurations inherit the model architecture setting by default, and changes from the model’s architecture setting in the deployment configuration are discouraged. For more details, see Model Uploads and Registrations. Available options are:
  • wallaroo.engine_config.Architecture.X86 (Default)
  • wallaroo.engine_config.Architecture.ARM
.
accelarchitecture: wallaroo.engine_config.AccelerationSets the AI hardware accelerator for model deployment.
IMPORTANT NOTE: Model acceleration should be set during the model upload process. Deployment configurations inherit the model architecture setting by default, and changes from the model’s architecture setting in the deployment configuration are discouraged. For more details, see Model Uploads and Registrations. Available options are:
  • wallaroo.engine_config.Acceleration._None (Default): No accelerator is assigned. This works for all infrastructures.
  • wallaroo.engine_config.Acceleration.AIO: AIO acceleration for Ampere Optimized trained models, only available with ARM processors.
  • wallaroo.engine_config.Acceleration.Jetson: Nvidia Jetson acceleration used with edge deployments with ARM processors.
  • wallaroo.engine_config.Acceleration.CUDA: Nvidia Cuda acceleration supported by both ARM and X64/X86 processors. This is intended for deployment with GPUs.
  • .

Containerized Runtime Configuration Methods

MethodParametersDescriptionEnterprise Only Feature
sidekick_cpus(model: wallaroo.model.Model, core_count: float)Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
  • Model model: The sidekick model to configure.
  • float core_count: Number of CPU cores to use in this sidekick.
 
sidekick_memory(model: wallaroo.model.Model, memory_spec: str)Sets the memory available to for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
  • Model model: The sidekick model to configure.
  • memory_spec: The amount of memory to allocated as memory unit values. The accepted unit values are:
    • KiB (for KiloBytes)
    • MiB (for MegaBytes)
    • GiB (for GigaBytes)
    • TiB (for TeraBytes)
    The values are similar to the Kubernetes memory resource units format.
 
sidekick_env(model: wallaroo.model.Model, environment: Dict[str, str])Environment variables submitted to the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. These are used specifically for containerized models that have environment variables that effect their performance. 
sidekick_gpus(model: wallaroo.model.Model, core_count: int)Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message will be returned during deployment. If called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance
sidekick_archarchitecture: wallaroo.engine_config.ArchitectureSets the CPU architecture for the deployment. This defaults to X86. Available options are:
  • wallaroo.engine_config.Architecture.X86
  • wallaroo.engine_config.Architecture.ARM

Deployment Replicas and Autoscale

Wallaroo supports deployment replicas and autoscaling.

A replica allocates the same number of cpus, gpus, and memory per replica . The deployment load balancer distributes inference requests across the replicas to provide consistent service. The number of replicas are set in one of two mutually exclusive ways:

  • Set number of replicas: Sets the number of replicas to a constant number via the replica_count(count) deployment setting. This is recommended for use cases where inference requests are steady over time. The number of replicas can be changed by creating a new deployment configuration and deploying again with the new deployment configuration.
  • Autoscale: The number of replicas changes depending on the cpu utilization. This is recommended when the inference requests increase or decrease, and provides organizations with a method to decrease the resources needed for a deployment and increase as required.

Autoscale replicas are controlled via the following deployment configuration settings.

  • replica_autoscale_min_max(maximum: int, minimum: int = 0): Sets the minimum and maximum replicas to deploy based on the cpu utilization setting.
  • autoscale_cpu_utilization(cpu_utilization_percentage: int) (Default: 50): When the CPU utilization reaches the cpu_utilization_percentage across all replicas in aggregate, a new replica is deployed. If CPU utilization across all replicas remains at that level, another replica is deployed until the number of replicas meet the maximum replica_autoscale_min_max values are reached. The number of replicas will remain deployed at this number until CPU utilization goes under the autoscale_cpu_utilization value for 60 minutes, then it reduces one replica at a time as long as CPU utilization remains under the cpu_utilization_percentage value across all replicas in aggregate.
IMPORTANT NOTE: Autoscale to 0 and CPU Utilization

For use cases where replica_autoscale_min_max has the minimum of 0, cpus must be at least 0.25 or baseline CPU activity will prevent scaling down the final replica to 0.

For example, replica_autoscale_min_max(minimum=0,maximum=5).cpus(0.25).

IMPORTANT NOTE: GPU Allocation and Autoscale

Allocating GPUs to a deployment configuration allocates either the replica_autoscale_min_max minimum number of replicas or the maximum number of replicas for any inference traffic.

For example, for replica_autoscale_min_max(minimum=0,maximum=5).gpus(1) is applied, when any inference request is received, the number of replicas allocates from 0 replicas to the replica_autoscale_min_max’s maximum value of 5.

When all inference requests cease for 60 minutes, the replica allocations drop back down to the replica_autoscale_min_max’s minimum setting of 0.

Deployment settings, such as the number of gpus, affect the autoscaling feature. The following describes how autoscaling performs based on different use cases.

Replica TypeBehaviorSample Use Case
Static ReplicasThe number of replicas remains constant.Inference requests are a constant over time to provide constant availability.
CPU Default ScalingThe number of replicas increases or decreases with the default autoscale_cpu_utilization of 50%.Inference requests fluctuate over time. This allows organizations to optimize their spend and the number of resources allocated to a deployment.
CPU ScalingThe number of replicas increases or decreases with the autoscale_cpu_utilization of 75%.Inference requests fluctuate over time, with a manually set CPU utilization before allocating another replica.
GPU ScalingThe number of replicas go from the replica_autoscale_min_max minimum value to the maximum value for any inference requests until all inference requests cease for 60 minutes.Inference requests fluctuate over time. Any inference request increases the number of replicas from the replica_autoscale_min_max’s minimum value to the replica_autoscale_min_max’s maximum replica value to provide full availability during that period. After all inference requests cease for 60 minutes, the number of replicas are reduced to the replica_autoscale_min_max’s minimum value.
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .replica_count(5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .cpus(4)
    .memory('3Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .replica_count(5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime. For this example, 1 gpu is assigned to the Containerized Runtime, and 0 gpus assigned to the Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(0.25)
    .memory('1Gi')
    .gpus(0)
    .deployment_label('doc-gpu-label:true')
    .sidekick_gpus(model, 1)
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.

Note that this configuration allocates 2 gpus to the model deployment per replica - one for the Native Runtime and one to the model deployed in the Containerized Runtime. Only one deployment label is required.

wallaroo.DeploymentConfigBuilder()
    .replica_count(5)
    .cpus(2)
    .memory('2Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .sidekick_gpus(model, 1)
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .cpus(4)
    .memory('3Gi')
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(0.25)
    .memory('1Gi')
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.
wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .build()
  • Native Runtime Deployment Configuration: All models run on Wallaroo Native Runtime.
wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .cpus(4)
    .memory('3Gi')
    .gpus(1)
    .deployment_label('doc-gpu-label:true')
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .build()
  • Containerized Runtime Deployment Configuration: A model is deployed to the Wallaroo Containerized Runtime with no models deployed to the Wallaroo Native runtime.
wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .cpus(0.25)
    .memory('1Gi')
    .gpus(0)
    .sidekick_cpus(model, 4)
    .sidekick_memory(model, '3Gi')
    .sidekick_gpus(model,1)
    .deployment_label('doc-gpu-label:true')
    .build()
  • Combined Runtimes Deployment Configuration: One or more models run in the Wallaroo Native Runtime and one or more models run in the Wallaroo Containerized Runtime.

Note: This deployment will use 2 GPUs - one for the native runtime, one for the containerized runtime. Only one deployment label is required.

wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .gpus(1)
    .cpus(2)
    .memory('2Gi')
    .sidekick_cpus(model, 3)
    .sidekick_memory(model, '3Gi')
    .sidekick_gpus(model, 1)
    .deployment_label('doc-gpu-label:true')
    .build()

Inference from Zero Scaled Deployments

For deployments that autoscale from 0 replicas, replica_autoscale_min_max is set with minimum=0 and replicas scale down to zero when there is no utilization for 60 minutes. When a new inference request is made, the first replica is scaled up. Once the first replica is ready, inference requests proceed as normal.

When inferencing in this scenario, including within ML Workload Orchestrations, a timeout may occur waiting for the first replica to finish spooling up. To handle situations where an autoscale deployment scales down to zero replicas, the following code example provides a way to “wake up” the pipeline with an inference request which may use mock or real data. Once the first replica is fully spooled up, inference requests proceed at full speed.

The following deploys a pipeline with 4 cpus and 3 GB RAM per replica, with the autoscale set between 0 and 5 replicas.

Once deployed, we check the pipeline’s deployment status to verify it is running. If the pipeline is still scaling, the process waits 10 seconds to allow it to finish scaling. Once an inference completes successfully, the inferences proceed as normal.

# deployment configuration with autoscaling between 0 and 5 replicas
deployment_configuration = wallaroo.DeploymentConfigBuilder()
    .autoscale_cpu_utilization(75)
    .cpus(4)
    .memory('3Gi')
    .replica_autoscale_min_max(minimum=0, maximum=5)
    .build()

# deployment with the deployment configuration
pipeline.deploy(deployment_configuration)

# verify deployment has the status `Running`
while pipeline.status()["status"] != 'Running':
    try:
        # attempt the inference
        pipeline.infer(dataframe)
    except:
        # if an exception is thrown, pass it
        pass
    # wait 10 seconds before attempting the inference again
    time.sleep(10)
# when the inference passes successfully, continue with other inferences as normal
pipeline.infer(dataframe2)
pipeline.infer(dataframe3)
...