Autoscaling for LLMs
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Table of Contents
Wallaroo deployment configurations set what resources are allocated to LLMs for inference requests. Autoscale triggers provide LLMs greater flexibility by:
- Increasing resources to LLMs based on scale up and down triggers. This decreases inference latency when more requests come in, then spools idle resources back down to save on costs.
- Smooths the allocation of resources by optional autoscaling windows that allows scaling up and down over a longer period of time, preventing sudden resources spikes and drops.
How Autoscale Triggers Work
Autoscale triggers work through deployment configurations that have minimum and maximum autoscale replicas set by the parameter replica_autoscale_min_max
. The default minimum is 0 replicas. Resources are scaled as follows:
- 0 Replicas up: If there is 1 or more inference requests in the queue, 1 replica is spun up to process the requests in the queue. Additional resources are spun up based on the
autoscale_cpu_utilization
setting, where additional replicas are spun up or down when average cpu utilization across all replicas passes theautoscale_cpu_utilization
percentage. - If
scale_up_queue_depth
is set:autoscale_cpu_utilization
is overridden, and replica scaling is now based onscale_up_queue_depth
.scale_up_queue_depth
is based on the number of requests in the queue plus the requests currently being processed, divided by the number of available replicas. If this threshold is exceeded, then additional replicas are spun up based on theautoscaling_window
default of 300 seconds.
For example:
- The
replica_autoscale_min_max
is set to(maximum: 5, minimum: 0)
, withscale_up_queue_depth
set to 5 andautoscaling_window
is left as the default of300
for300 seconds
. - At the first inference request, the first inference request is put into a queue while 1 replica is spun up. As other inference requests are submitted they are place into the queue.
- Once the first replica is available, the queued inference requests are processed in FIFO (First In First Out).
- While processing inference requests, if the total number of inference requests in the queue stays at more than the
autoscaling_window
or longer, an additional replica is spun up to processes requests. - Additional replicas are spun up as long as the total requests in queue exceeds the
autoscaling_window
for 5 minute periods after each replica is scaled up. - Replicas a spun down when the queue average over as defined by the
autoscaling_window
is less than 1.
Autoscale Triggers Parameters
Wallaroo autoscale triggers are optional deployment configurations with the following parameters. For a complete list of deployment parameters, see Deployment Configurations.
Parameter | Type | Description |
---|---|---|
scale_up_queue_depth | (queue_depth: int) | The threshold for autoscaling additional deployment resources are scaled up. This requires the deployment configuration parameter replica_autoscale_min_max is set. scale_up_queue_depth is determined by the formula (number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window) . This field overrides the deployment configuration parameter cpu_utilization . The scale_up_queue_depth applies to all resources in the deployment configuration. |
scale_down_queue_depth | (queue_depth: int) , Default: 1 | Only applies with scale_up_queue_depth is configured. Scales down resources based on the formula (number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window) . |
autoscaling_window | (window_seconds: int) (Default: 300, Minimum allowed: 60) | The period over which to scale up or scale down resources. Only applies when scale_up_queue_depth is configured. |
replica_autoscale_min_max | (maximum: int, minimum: int = 0) | Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs. |
Autoscale Trigger Examples
Autoscale triggers apply all defined deployment configuration resources: cpus, gpus, memory, etc for both Wallaroo Native and Wallaroo Containerized Runtimes. See deployment configurations for a breakdown of each Runtime.
The following example demonstrates setting two different deployment configurations with different autoscale triggers.
Autoscale triggers are applied to LLM deployments through the following procedure:
- Set the deployment configurations and specify the resources, number of replicas, etc with the autoscale triggers.
- Create a Wallaroo pipeline and add the LLM(s) as pipeline steps.
- Deploy the Wallaroo pipeline with the deployment configuration.
In the following example, two deployment configurations are made:
Resource Allocation | Behavior |
---|---|
Sets resources to the LLM llm_gpu with the following allocations:
|
|
deployment_with_gpu =wallaroo.DeploymentConfigBuilder()
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(1).memory('2Gi')
.sidekick_gpus(llm_gpu, 1)
.sidekick_memory(llm_gpu, '24Gi')
.scale_up_queue_depth(5)
.autoscaling_window(600)
.build()
Create the Wallaroo pipeline and assign the LLM as a pipeline step.
# create the pipeline
llm_gpu_pipeline = wl.build_pipeline('sample-llm-with-gpu-pipeline')
# add the LLM as a pipeline model step
llm_gpu_pipeline.add_model_step(llm_gpu)
The pipeline is deployed with the deployment_with_gpu
deployment.
llm_gpu_pipeline.deploy(deployment_with_gpu)
Resource Allocation | Behavior |
---|---|
Sets resources to the LLM deployment_cpu_based with the following allocations:
|
|
deployment_cpu_based =wallaroo.DeploymentConfigBuilder()
.replica_autoscale_min_max(minimum=0, maximum=5)
.cpus(1).memory('2Gi')
.sidekick_cpus(llm, 30)
.sidekick_memory(llm, '10Gi')
.scale_up_queue_depth(1)
.scale_down_queue_depth(1)
.build()
Create the Wallaroo pipeline and assign the LLM as a pipeline step.
# create the pipeline
llm_pipeline = wl.build_pipeline('sample-llm-pipeline')
# add the LLM as a pipeline model step
llm_pipeline.add_model_step(llm)
The pipeline is deployed with the deployment_cpu_based
deployment.
llm_pipeline.deploy(deployment_cpu_based)
Inference from Zero Scaled Deployments
For deployments that autoscale from 0 replicas, replica_autoscale_min_max
is set with minimum=0
and replicas scale down to zero when there is no utilization based on the autoscale parameters. When a new inference request is made, the first replica is scaled up. Once the first replica is ready, inference requests proceed as normal.
When inferencing in this scenario, a timeout may occur waiting for the first replica to spool up. To handle situations where an autoscale deployment scales down to zero replicas, the following code example provides a way to “wake up” the pipeline with an inference request which may use mock or real data. Once the first replica is fully spooled up, inference requests proceed at full speed.
Once deployed, we check the pipeline’s deployment status to verify it is running. If the pipeline is still scaling, the process waits 10 seconds to allow it to finish scaling before attempting the initial inference again. Once an inference completes successfully, the inferences proceed as normal.
# verify deployment has the status `Running`
while pipeline.status()["status"] != 'Running':
try:
# attempt the inference
pipeline.infer(dataframe)
except:
# if an exception is thrown, pass it
pass
# wait 10 seconds before attempting the inference again
time.sleep(10)
# when the inference passes successfully, continue with other inferences as normal
pipeline.infer(dataframe2)
pipeline.infer(dataframe3)
Tutorials
Troubleshooting
Resources Required Not Available
When replica_autoscale_min_max
is set with minimum=0
, the first replica is not spun up until an incoming inference request is received. Once the replica status is running
the inference requests are processed.
During the period while the first replica is spun up from zero, the following error message is returned:
"The resources required to run this request are not available. Please check the inference endpoint status and try again."
See Inference from Zero Scaled Deployments for details on how to structure inference requests from autoscale replicas that start from zero scaled deployments.