Pipeline deployments configurations allow tailoring of a pipeline’s resources to match an organization’s and model’s requirements. Pipelines may require more memory, CPU cores, or GPUs to run to run all its steps efficiently. Pipeline deployment configurations also allow for multiple replicas of a model in a pipeline to provide scalability.
Setting a pipeline deployment configuration follows this process:
deployment_config.DeploymentConfigBuilder()
](https://docs.wallaroo.ai/20230201/wallaroo-developer-guides/wallaroo-sdk-guides/wallaroo-sdk-reference-guide/deployment_config/#DeploymentConfigBuilder) class.deployment_config.build()
method.The following example shows a pipeline deployment configuration with 1 replica, 1 cpu, and 2Gi
of memory set to be allocated to the pipeline.
deployment_config = wallaroo.DeploymentConfigBuilder()
.replica_count(1)
.cpus(1)
.memory("2Gi")
.build()
pipeline.deploy(deployment_config = deployment_config)
Pipeline resources can be configured with autoscaling. Autoscaling allows the user to define how many engines a pipeline starts with, the minimum amount of engines a pipeline uses, and the maximum amount of engines a pipeline can scale to. The pipeline scales up and down based on the average CPU utilization across the engines in a given pipeline as the user’s workload increases and decreases.
Pipeline deployment configurations deal with two major components:
These configurations can be mixed - both native runtimes and containerized runtimes deployed to the same pipeline, with resources allocated to each runtimes in different configurations.
The following resources configurations are available through the wallaroo.deployment_config
object.
CPUs are allocated in fractions of total CPU power similar to the Kubernetes CPU definitions. cpus(0.25)
, cpus(1.0)
, etc are valid values.
GPUs can only be allocated by entire integer units from the GPU enabled nodepools. gpus(1)
, gpus(2)
, etc are valid values, while gpus(0.25)
are not.
Organizations should be aware of how many GPUs are allocated to the cluster. If all GPUs are already allocated to other pipelines, or if there are not enough GPUs to fulfill the request, the pipeline deployment will fail and return an error message.
Wallaroo 2023.2.1 and above supports Kubernetes nodepools with Nvidia Cuda GPUs.
See the Create GPU Nodepools for Kubernetes Clusters guide for instructions on adding GPU enabled nodepools to a Kubernetes cluster.
deployment_label
configuration option must be used.Method | Parameters | Description | Enterprise Only Feature |
---|---|---|---|
replica_count | (count: int) | The number of replicas of the pipeline to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization. | √ |
replica_autoscale_min_max | (maximum: int, minimum: int = 0) | Provides replicas to be scaled from 0 to some maximum number of replicas. This allows pipelines to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs. | √ |
autoscale_cpu_utilization | (cpu_utilization_percentage: int) | Sets the average CPU percentage metric for when to load or unload another replica. | √ |
disable_autoscale | Disables autoscaling in the deployment configuration. | ||
cpus | (core_count: float) | Sets the number or fraction of CPUs to use for the pipeline, for example: 0.25 , 1 , 1.5 , etc. The units are similar to the Kubernetes CPU definitions. | |
gpus | (core_count: int) | Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which pipeline deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a pipeline deployment configuration, and error message will be deployed when the pipeline is deployed. If gpus is called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance. | √ |
memory | memory_spec: str | Sets the amount of RAM to allocate the pipeline. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
| |
lb_cpus | (core_count: float) | Sets the number or fraction of CPUs to use for the pipeline’s load balancer, for example: 0.25 , 1 , 1.5 , etc. The units, similar to the Kubernetes CPU definitions. | |
lb_memory | memory_spec: str | Sets the amount of RAM to allocate the pipeline’s load balancer. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
| |
deployment_label | Label used for Kubernetes labels. Required if gpus are set and must match the GPU nodepool label. | √ |
Method | Parameters | Description | Enterprise Only Feature |
---|---|---|---|
sidekick_cpus | (model: wallaroo.model.Model, core_count: float) | Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
| |
sidekick_memory | (model: wallaroo.model.Model, memory_spec: str) | Sets the memory available to for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
| |
sidekick_env | (model: wallaroo.model.Model, environment: Dict[str, str]) | Environment variables submitted to the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. These are used specifically for containerized models that have environment variables that effect their performance. | |
sidekick_gpus | (model: wallaroo.model.Model, core_count: int) | Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which pipeline deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a pipeline deployment configuration, and error message will be deployed when the pipeline is deployed. If called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance. | √ |
The following will set native runtime deployment to one quarter of a CPU with 1 Gi of Ram:
deployment_config = DeploymentConfigBuilder() \
.cpus(0.25).memory('1Gi') \
.build()
This example sets the replica count to 1, then sets the auto-scale to vary between 2 to 5 replicas depending on need, with 1 CPU and 1 GI RAM allocated per replica.
deploy_config = (wallaroo.DeploymentConfigBuilder()
.replica_count(1)
.replica_autoscale_min_max(minimum=2, maximum=5)
.cpus(1)
.memory("1Gi")
.build()
)
The following configuration allocates 1 GPU to the pipeline for native runtimes.
deployment_config = DeploymentConfigBuilder()
.cpus(0.25)
.memory('1Gi')
.gpus(1)
.deployment_label('doc-gpu-label:true')
.build()
The following configuration allocates 0.25 CPU and 1Gi RAM to the containerized runtime sm_model
, and passes that runtime environmental variables used for timeout settings.
deployment_config = DeploymentConfigBuilder()
.sidekick_cpus(sm_model, 0.25)
.sidekick_memory(sm_model, '1Gi')
.sidekick_env(sm_model,
{"GUNICORN_CMD_ARGS":
"__timeout=188 --workers=1"}
)
.build()
This example shows allocating 1 GPU to the containerized runtime model sm_model
.
deployment_config = DeploymentConfigBuilder()
.sidekick_gpus(sm_model, 1)
.deployment_label('doc-gpu-label:true')
.sidekick_memory(sm_model, '1Gi')
.build()
The following configuration allocates 1 gpu to the pipeline for native runtimes, then another gpu to the containerized runtime sm_model
for a total of 2 gpus allocated to the pipeline: one gpu for native runtimes, another gpu for the containerized runtime model sm_model
.
deployment_config = DeploymentConfigBuilder()
.cpus(0.25)
.memory('1Gi')
.gpus(1)
.sidekick_gpus(sm_model, 1)
.deployment_label('doc-gpu-label:true')
.build()