Dynamic Batching for LLMs

Dynamic batching improves inference result performance at scale in high traffic scenarios.
For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

How Dynamic Batching Works

When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processed at once. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.

Dynamic Batching Configurations are defined by the following:

Max batch delay: The amount of time in milliseconds (Default: 10) to watch before sending each batch to the model for an inference request.
Batch size target: Minimum size of a batch (Default: 4) sent to the model. Dynamic batching of inferences is triggered when either the max batch delay OR batch size target are met. When either of those conditions are met, inference requests are collected together then processed as a single batch.
Batch size limit (Optional): The maximum size of a batch the model can process (Default: None). This is a guardrail to control the maximum batch size.

Once defined, Dynamic Batching Configurations are assigned through the model configuration when uploading or retrieving the LLM in a Wallaroo workspace. This allows the LLM to be deployed in Wallaroo pipelines with the same Dynamic Batching Configuration applied.

IMPORTANT NOTE

Dynamic batching is not available for Wallaroo pipelines with multiple steps.

When dynamic batching is implemented, the following occurs:

Inference requests are processed in FIFO (First In First Out) order. Inference requests containing batched inputs are not split to accommodate dynamic batching.
Inference results are returned back to the original clients.
Inference result logs store results in the order the inferences were processed and batched.
Dynamic Batching Configurations and target latency are honored and are not impacted by Wallaroo pipeline deployment autoscaling configurations.

How to Configure Dynamic Batching for LLMs

Dynamic batching is applied to LLMs via the Wallaroo SDK or the Wallaroo MLOps API.

Configure Dynamic Batching via the Wallaroo SDK

Dynamic Batching for LLMs are configured via the Wallaroo SDK through the following steps.

Define the Dynamic Batch Config

The Dynamic Batch Config is configured in the Wallaroo SDK via the from wallaroo.dynamic_batching_config.DynamicBatchingConfig object, which takes the following parameters.

Parameter	Type	Description
`max_batch_delay_ms`	Integer (Default: 10)	Set the maximum batch delay in milliseconds.
`batch_size_target`	Integer (Default: 4)	Set the target batch size; can not be less than or equal to zero.
`batch_size_limit`	Integer (Default: None)	Set the batch size limit; can not be less than or equal to zero. This is used to control the maximum batch size.

For example, the following sets the dynamic batch config with:

Maximum batch delay of 20 milliseconds
Batch size target of 3.
Batch size limit of 10.

This is saved to the variable dynamic_batch_config.

dynamic_batch_config = wallaroo.dynamic_batching_config.DynamicBatchingConfig(
                       max_batch_delay_ms=20,
                       batch_size_target=3,
                       batch_size_limit=10)

Apply Dynamic Batch Config at LLM Upload

Dynamic batch configs are applied at llm upload via the wallaroo.client.Client.upload_model.configure method.

LLM uploads with Dynamic Batch Configuration proceeds in two parts:

Define the model upload parameters with wallaroo.client.Client.upload_model.
Set the model configuration with wallaroo.client.Client.upload_model. This step is only required for LLMs configured for single batch or dynamic batch deployment.

wallaroo.client.Client.upload_model has the following parameters.

Parameter	Type	Description
`name`	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
`path`	`string` (Required)	The path to the model file being uploaded.
`framework`	`string` (Required)	The framework of the model from `wallaroo.framework`.
`input_schema`	`pyarrow.lib.Schema` Native Wallaroo Runtimes: (Optional) Non-Native Wallaroo Runtimes: (Required)	The input schema in Apache Arrow schema format.
`output_schema`	`pyarrow.lib.Schema` Native Wallaroo Runtimes: (Optional) Non-Native Wallaroo Runtimes: (Required)	The output schema in Apache Arrow schema format.
`convert_wait`	`bool` (Optional)	True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

wallaroo.client.Client.upload_model.configure has the following parameters.

Parameter	Type	Description
`dynamic_batching_config`	wallaroo.DynamicBatchingConfig (Default: None)	Sets the dynamic batch config to apply to the model.
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format. This field is required when the `dynamic_batch_config` parameter is set.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format. This field is required when the `dynamic_batch_config` parameter is set.
`batch_config`	String	Batch config is either `None` for multiple-input inferences, or `single` to accept an inference request with only one row of data. This setting is mutually exclusive with `dynamic_batching_config`. If `dynamic_batching_config` is set, `batch_config` must be `None`. If `batch_config` is set to `single` and a `dynamic_batch_config` is set, the following error is returned: `Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai.`

The following demonstrates applying the dynamic batch config at LLM upload.

llm_model = (wl.upload_model(model_name, 
                              model_file_name, 
                              framework=framework,
                              input_schema=input_schema,
                              output_schema=output_schema)
                              .configure(input_schema=input_schema,
                                         output_schema=output_schema,
                                         dynamic_batching_config=dynamic_batch_config)
                        )

Apply Dynamic Batch Config at LLM Retrieval

LLMs are retrieved via the wallaroo.client.Client.get_model which takes the following parameters:

Parameter	Type	Description
name	String (Required)	The name of the model to reference in the current workspace.
version	String (Optional) (Default: `None`)	Returns the model version matching the `version` parameter. By default, returns the most recent model version for the model.

Once retrieved, a dynamic batching configuration is applied with the wallaroo.client.Client.upload_model.configure with following parameters.

Parameter	Type	Description
`dynamic_batching_config`	wallaroo.DynamicBatchingConfig (Default: None)	Sets the dynamic batch config to apply to the model.
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format. This field is required when the `dynamic_batch_config` parameter is set.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format. This field is required when the `dynamic_batch_config` parameter is set.
`batch_config`	String	Batch config is either `None` for multiple-input inferences, or `single` to accept an inference request with only one row of data. This setting is mutually exclusive with `dynamic_batching_config`. If `dynamic_batching_config` is set, `batch_config` must be `None`. If `batch_config` is set to `single` and a `dynamic_batch_config` is set, the following error is returned: `Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai.`

The following example demonstrates retrieving a previously uploaded LLM, then applying a dynamic batch configuration to the model.

sample_llm = wl.get_model(model_name)
sample_llm.configure(input_schema=input_schema,
                               output_schema=output_schema,
                               dynamic_batching_config=dynamic_batch_config)

Deploy LLM with Dynamic Batch Configuration

Deploying a LLM with a Dynamic Batch configuration requires the same steps as deploying a LLM without a Dynamic Batch configuration:

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
Create a Wallaroo pipeline and add the LLM with the Dynamic Batch configuration as a model step.
Deploy the Wallaroo pipeline with the deployment configuration.

Deployment Configuration

The deployment configuration sets what resources are allocated for the LLM’s use. For this example, the LLM is allocated 8 cpus, 10 Gi RAM, and 1 GPU.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(llm_model, 8) \
    .sidekick_memory(llm_model, '10Gi') \
    .sidekick_gpus(llm_model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

Create Wallaroo Pipeline and Set Model Step

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM. For Dynamic Batching, only one pipeline step is allowed.

The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.

# create the pipeline
llm_pipeline = wl.build_pipeline('sample-llm-pipeline')

# add the LLM as a pipeline model step
llm_pipeline.add_model_step(llm_model)

Deploy Model

With the Deployment Configuration assigned to the model and the pipeline ready, the pipeline is deployed with the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig]) method. This allocates resources from the cluster for the LLMs deployment based on the DeploymentConfig settings. If the resources set in the deployment configuration are not available at deployment, an error is returned.

The following example demonstrates deploying the LLM. Note that the Dynamic Batching Configuration is not specified in this step - that was already assigned via the Configure Dynamic Batching via the Wallaroo SDK process.

The following example demonstrates deploying the LLM with the previously defined deployment configuration:

llm_pipeline.deploy(deployment_config)

Once the deployment configuration is complete, the LLM is ready to accept inference requests.

How to Update Dynamic Batching for LLMs

To update a Dynamic Batch Configuration for the LLM:

Define the new Dynamic Batch Configuration
Assign the new Dynamic Batch Configuration to the LLM Configuration
Clear the pipeline steps, and assign the LLM as the pipeline step. This removes the previous LLM with previous Dynamic Batch Configuration and sets the new Dynamic Batch Configuration.
Deploy the pipeline. Note that undeploying the pipeline is not required before deploying the LLM with the new Dynamic Batch Configuration.

The following example shows creating a new Dynamic Batch Configuration, assigning it to the LLM, and deploying the LLM with the new Dynamic Batch Configuration.

dynamic_batch_config = wallaroo.dynamic_batching_config.DynamicBatchingConfig(
                        max_batch_delay_ms = 20,
                        batch_size_target = 3,
                        batch_size_limit = 1)

sample_llm.configure(input_schema=input_schema,
                     output_schema=output_schema,
                     dynamic_batching_config=dynamic_batch_config)

llm_pipeline.clear()
llm_pipeline.add_model_step(llm_model)
llm_pipeline.deploy(deployment_config)

Tutorials

Dynamic Batching with Llama 3 8B Instruct vLLM Tutorial

Troubleshooting

Exceed `batch_size_limit`

If the dynamic batching configuration batch_size_limit is exceeded, the following error message is returned:

Maximum batch size exceeded, please configure a higher limit or contact wallaroo for support at support@wallaroo.ai.

Dynamic Batching with Single Batch Configuration Error

Dynamic Batch Configurations and Single Batch Configuration can not be combined together. If a Dynamic Batch Configuration and Single Batch are added to a LLM’s configuration, the following error is returned:

 “Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai”