Autoscaling for LLMs

Autoscale triggers reduces latency for LLM inference requests by adding additional resources and scaling them down based on scale up and scale down settings.
For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Wallaroo deployment configurations set what resources are allocated to LLMs for inference requests. Autoscale triggers provide LLMs greater flexibility by:

Increasing resources to LLMs based on scale up and down triggers. This decreases inference latency when more requests come in, then spools idle resources back down to save on costs.
Smooths the allocation of resources by optional autoscaling windows that allows scaling up and down over a longer period of time, preventing sudden resources spikes and drops.

How Autoscale Triggers Work

Autoscale triggers work through deployment configurations that have minimum and maximum autoscale replicas set by the parameter replica_autoscale_min_max. The default minimum is 0 replicas. Resources are scaled as follows:

0 Replicas up: If there is 1 or more inference requests in the queue, 1 replica is spun up to process the requests in the queue. Additional resources are spun up based on the autoscale_cpu_utilization setting, where additional replicas are spun up or down when average cpu utilization across all replicas passes the autoscale_cpu_utilization percentage.
If scale_up_queue_depth is set: autoscale_cpu_utilization is overridden, and replica scaling is now based on scale_up_queue_depth. scale_up_queue_depth is based on the number of requests in the queue plus the requests currently being processed, divided by the number of available replicas. If this threshold is exceeded, then additional replicas are spun up based on the autoscaling_window default of 300 seconds.

For example:

The replica_autoscale_min_max is set to (maximum: 5, minimum: 0), with scale_up_queue_depth set to 5 and autoscaling_window is left as the default of 300 for 300 seconds.
At the first inference request, the first inference request is put into a queue while 1 replica is spun up. As other inference requests are submitted they are place into the queue.
Once the first replica is available, the queued inference requests are processed in FIFO (First In First Out).
While processing inference requests, if the total number of inference requests in the queue stays at more than the autoscaling_window or longer, an additional replica is spun up to processes requests.
Additional replicas are spun up as long as the total requests in queue exceeds the autoscaling_window for 5 minute periods after each replica is scaled up.
Replicas a spun down when the queue average over as defined by the autoscaling_window is less than 1.

Autoscale Triggers Parameters

Wallaroo autoscale triggers are optional deployment configurations with the following parameters. For a complete list of deployment parameters, see Deployment Configurations.

Parameter	Type	Description
`scale_up_queue_depth`	`(queue_depth: int)`	The threshold for autoscaling additional deployment resources are scaled up. This requires the deployment configuration parameter `replica_autoscale_min_max` is set. `scale_up_queue_depth` is determined by the formula `(number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window)`. This field overrides the deployment configuration parameter `cpu_utilization`. The `scale_up_queue_depth` applies to all resources in the deployment configuration.
`scale_down_queue_depth`	`(queue_depth: int)`, Default: 1	Only applies with `scale_up_queue_depth` is configured. Scales down resources based on the formula `(number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window)`.
`autoscaling_window`	`(window_seconds: int)` (Default: 300, Minimum allowed: 60)	The period over which to scale up or scale down resources. Only applies when `scale_up_queue_depth` is configured.
`replica_autoscale_min_max`	`(maximum: int, minimum: int = 0)`	Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs.

Autoscale Trigger Examples

Autoscale triggers apply all defined deployment configuration resources: cpus, gpus, memory, etc for both Wallaroo Native and Wallaroo Containerized Runtimes. See deployment configurations for a breakdown of each Runtime.

The following example demonstrates setting two different deployment configurations with different autoscale triggers.

Autoscale triggers are applied to LLM deployments through the following procedure:

Set the deployment configurations and specify the resources, number of replicas, etc with the autoscale triggers.
Create a Wallaroo pipeline and add the LLM(s) as pipeline steps.
Deploy the Wallaroo pipeline with the deployment configuration.

In the following example, two deployment configurations are made:

Resource Allocation	Behavior
Sets resources to the LLM `llm_gpu` with the following allocations: Replica autoscale: 0 to 5. Wallaroo engine per replica: 1 cpu, 2 Gi RAM `llm_gpu` per replica: 1 gpu 24 Gi RAM scale_up_queue_depth: 5 Autoscaling window: 600	Wallaroo engine and LLM scaling is 1:1. Scaling up occurs when the scale up queue depth is above 5 over 600 seconds

deployment_with_gpu =wallaroo.DeploymentConfigBuilder()
                       .replica_autoscale_min_max(minimum=0, maximum=5)
                       .cpus(1).memory('2Gi')
                       .sidekick_gpus(llm_gpu, 1)
                       .sidekick_memory(llm_gpu, '24Gi')
                       .scale_up_queue_depth(5)
                       .autoscaling_window(600)
                       .build()

Create the Wallaroo pipeline and assign the LLM as a pipeline step.

# create the pipeline
llm_gpu_pipeline = wl.build_pipeline('sample-llm-with-gpu-pipeline')

# add the LLM as a pipeline model step
llm_gpu_pipeline.add_model_step(llm_gpu)

The pipeline is deployed with the deployment_with_gpu deployment.

llm_gpu_pipeline.deploy(deployment_with_gpu)

Resource Allocation	Behavior
Sets resources to the LLM `deployment_cpu_based` with the following allocations: Replica autoscale: 0 to 5. Wallaroo engine per replica: 1 cpu, 2 Gi RAM `llm` per replica: 30 cpus 10 Gi RAM scale_up_queue_depth: 1 scale_down_queue_depth: 1	Wallaroo engine and LLM scaling is 1:1. Scaling up occurs when the scale up queue depth is above 5 over 300 seconds. Scaling down is triggered when the 5 minute queue average is < 1.

deployment_cpu_based =wallaroo.DeploymentConfigBuilder()
                       .replica_autoscale_min_max(minimum=0, maximum=5)
                       .cpus(1).memory('2Gi')
                       .sidekick_cpus(llm, 30)
                       .sidekick_memory(llm, '10Gi')
                       .scale_up_queue_depth(1)
                       .scale_down_queue_depth(1)
                       .build()

Create the Wallaroo pipeline and assign the LLM as a pipeline step.

# create the pipeline
llm_pipeline = wl.build_pipeline('sample-llm-pipeline')

# add the LLM as a pipeline model step
llm_pipeline.add_model_step(llm)

The pipeline is deployed with the deployment_cpu_based deployment.

llm_pipeline.deploy(deployment_cpu_based)

Inference from Zero Scaled Deployments

For deployments that autoscale from 0 replicas, replica_autoscale_min_max is set with minimum=0 and replicas scale down to zero when there is no utilization based on the autoscale parameters. When a new inference request is made, the first replica is scaled up. Once the first replica is ready, inference requests proceed as normal.

When inferencing in this scenario, a timeout may occur waiting for the first replica to spool up. To handle situations where an autoscale deployment scales down to zero replicas, the following code example provides a way to “wake up” the pipeline with an inference request which may use mock or real data. Once the first replica is fully spooled up, inference requests proceed at full speed.

Once deployed, we check the pipeline’s deployment status to verify it is running. If the pipeline is still scaling, the process waits 10 seconds to allow it to finish scaling before attempting the initial inference again. Once an inference completes successfully, the inferences proceed as normal.

# verify deployment has the status `Running`
while pipeline.status()["status"] != 'Running':
    try:
        # attempt the inference
        pipeline.infer(dataframe)
    except:
        # if an exception is thrown, pass it
        pass
    # wait 10 seconds before attempting the inference again
    time.sleep(10)

# when the inference passes successfully, continue with other inferences as normal
pipeline.infer(dataframe2)
pipeline.infer(dataframe3)

Tutorials

Autoscale Triggers with Llama 3 8B with Llama.cpp Tutorial

Troubleshooting

Resources Required Not Available

When replica_autoscale_min_max is set with minimum=0, the first replica is not spun up until an incoming inference request is received. Once the replica status is running the inference requests are processed.

During the period while the first replica is spun up from zero, the following error message is returned:

"The resources required to run this request are not available. Please check the inference endpoint status and try again."

See Inference from Zero Scaled Deployments for details on how to structure inference requests from autoscale replicas that start from zero scaled deployments.