Dynamic Batching for LLMs
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Table of Contents
How Dynamic Batching Works
When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processed at once. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.
Dynamic Batching Configurations are defined by the following:
- Max batch delay: The amount of time in milliseconds (Default: 10) to watch before sending each batch to the model for an inference request.
- Batch size target: Minimum size of a batch (Default: 4) sent to the model.
- Batch size limit (Optional): The maximum size of a batch the model can process (Default:
None
). This is a guardrail to control the maximum batch size.
Once defined, Dynamic Batching Configurations are assigned through the model configuration when uploading or retrieving the LLM in a Wallaroo workspace. This allows the LLM to be deployed in Wallaroo pipelines with the same Dynamic Batching Configuration applied.
IMPORTANT NOTE
Dynamic batching is not available for Wallaroo pipelines with multiple steps.Dynamic batching of inferences is triggered when either the max batch delay OR batch size target are met. When either of those conditions are met, inference requests are collected together then processed as a single batch.
When dynamic batching is implemented, the following occurs:
- Inference requests are processed in FIFO (First In First Out) order. Inference requests containing batched inputs are not split to accommodate dynamic batching.
- Inference results are returned back to the original clients.
- Inference result logs store results in the order the inferences were processed and batched.
- Dynamic Batching Configurations and target latency are honored and are not impacted by Wallaroo pipeline deployment autoscaling configurations.
How to Configure Dynamic Batching for LLMs
Dynamic batching is applied to LLMs via the Wallaroo SDK or the Wallaroo MLOps API.
Configure Dynamic Batching via the Wallaroo SDK
Dynamic Batching for LLMs are configured via the Wallaroo SDK through the following steps.
Define the Dynamic Batch Config
The Dynamic Batch Config is configured in the Wallaroo SDK via the from wallaroo.dynamic_batching_config.DynamicBatchingConfig
object, which takes the following parameters.
Parameter | Type | Description |
---|---|---|
max_batch_delay_ms | Integer (Default: 10) | Set the maximum batch delay in milliseconds. |
batch_size_target | Integer (Default: 4) | Set the target batch size; can not be less than or equal to zero. |
batch_size_limit | Integer (Default: None) | Set the batch size limit; can not be less than or equal to zero. This is used to control the maximum batch size. |
For example, the following sets the dynamic batch config with:
- Maximum batch delay of 20 milliseconds
- Batch size target of 3.
- Batch size limit of 1.
This is saved to the variable dynamic_batch_config
.
dynamic_batch_config = wallaroo.dynamic_batching_config.DynamicBatchingConfig(
max_batch_delay_ms=20,
batch_size_target=3,
batch_size_limit=1)
Apply Dynamic Batch Config at LLM Upload
Dynamic batch configs are applied at llm upload via the wallaroo.client.Client.upload_model.configure
method.
LLM uploads with Dynamic Batch Configuration proceeds in two parts:
- Define the model upload parameters with
wallaroo.client.Client.upload_model
. - Set the model configuration with
wallaroo.client.Client.upload_model
. This step is only required for LLMs configured for single batch or dynamic batch deployment.
wallaroo.client.Client.upload_model
has the following parameters.
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework . |
input_schema | pyarrow.lib.Schema
| The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema
| The output schema in Apache Arrow schema format. |
convert_wait | bool (Optional) |
|
wallaroo.client.Client.upload_model.configure
has the following parameters.
Parameter | Type | Description |
---|---|---|
dynamic_batching_config | wallaroo.DynamicBatchingConfig (Default: None) | Sets the dynamic batch config to apply to the model. |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. This field is required when the dynamic_batch_config parameter is set. |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. This field is required when the dynamic_batch_config parameter is set. |
batch_config | String | Batch config is either None for multiple-input inferences, or single to accept an inference request with only one row of data. This setting is mutually exclusive with dynamic_batching_config . If dynamic_batching_config is set, batch_config must be None . If batch_config is set to single and a dynamic_batch_config is set, the following error is returned: Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai. |
The following demonstrates applying the dynamic batch config at LLM upload.
llm_model = (wl.upload_model(model_name,
model_file_name,
framework=framework,
input_schema=input_schema,
output_schema=output_schema)
.configure(input_schema=input_schema,
output_schema=output_schema,
dynamic_batching_config=dynamic_batch_config)
)
Apply Dynamic Batch Config at LLM Retrieval
LLMs are retrieved via the wallaroo.client.Client.get_model
which takes the following parameters:
Parameter | Type | Description |
---|---|---|
name | String (Required) | The name of the model to reference in the current workspace. |
version | String (Optional) (Default: None ) | Returns the model version matching the version parameter. By default, returns the most recent model version for the model. |
Once retrieved, a dynamic batching configuration is applied with the wallaroo.client.Client.upload_model.configure
with following parameters.
Parameter | Type | Description |
---|---|---|
dynamic_batching_config | wallaroo.DynamicBatchingConfig (Default: None) | Sets the dynamic batch config to apply to the model. |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. This field is required when the dynamic_batch_config parameter is set. |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. This field is required when the dynamic_batch_config parameter is set. |
batch_config | String | Batch config is either None for multiple-input inferences, or single to accept an inference request with only one row of data. This setting is mutually exclusive with dynamic_batching_config . If dynamic_batching_config is set, batch_config must be None . If batch_config is set to single and a dynamic_batch_config is set, the following error is returned: Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai. |
The following example demonstrates retrieving a previously uploaded LLM, then applying a dynamic batch configuration to the model.
sample_llm = wl.get_model(model_name)
sample_llm.configure(input_schema=input_schema,
output_schema=output_schema,
dynamic_batching_config=dynamic_batch_config)
Deploy LLM with Dynamic Batch Configuration
Deploying a LLM with a Dynamic Batch configuration requires the same steps as deploying a LLM without a Dynamic Batch configuration:
- Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
- Create a Wallaroo pipeline and add the LLM with the Dynamic Batch configuration as a model step.
- Deploy the Wallaroo pipeline with the deployment configuration.
Deployment Configuration
The deployment configuration sets what resources are allocated for the LLM’s use. For this example, the LLM is allocated 8 cpus, 10 Gi RAM, and 1 GPU.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(llm_model, 8) \
.sidekick_memory(llm_model, '10Gi') \
.sidekick_gpus(llm_model, 1) \
.deployment_label("wallaroo.ai/accelerator:a100") \
.build()
Create Wallaroo Pipeline and Set Model Step
Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline
method. Pipeline steps are used to determine how inference data is provided to the LLM. For Dynamic Batching, only one pipeline step is allowed.
The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.
# create the pipeline
llm_pipeline = wl.build_pipeline('sample-llm-pipeline')
# add the LLM as a pipeline model step
llm_pipeline.add_model_step(llm_model)
Deploy Model
With the Deployment Configuration assigned to the model and the pipeline ready, the pipeline is deployed with the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig])
method. This allocates resources from the cluster for the LLMs deployment based on the DeploymentConfig
settings. If the resources set in the deployment configuration are not available at deployment, an error is returned.
The following example demonstrates deploying the LLM. Note that the Dynamic Batching Configuration is not specified in this step - that was already assigned via the Configure Dynamic Batching via the Wallaroo SDK process.
The following example demonstrates deploying the LLM with the previously defined deployment configuration:
llm_pipeline.deploy(deployment_config)
Once the deployment configuration is complete, the LLM is ready to accept inference requests.
How to Update Dynamic Batching for LLMs
To update a Dynamic Batch Configuration for the LLM:
- Define the new Dynamic Batch Configuration
- Assign the new Dynamic Batch Configuration to the LLM Configuration
- Clear the pipeline steps, and assign the LLM as the pipeline step. This removes the previous LLM with previous Dynamic Batch Configuration and sets the new Dynamic Batch Configuration.
- Deploy the pipeline. Note that undeploying the pipeline is not required before deploying the LLM with the new Dynamic Batch Configuration.
The following example shows creating a new Dynamic Batch Configuration, assigning it to the LLM, and deploying the LLM with the new Dynamic Batch Configuration.
dynamic_batch_config = wallaroo.dynamic_batching_config.DynamicBatchingConfig(
max_batch_delay_ms = 20,
batch_size_target = 3,
batch_size_limit = 1)
sample_llm.configure(input_schema=input_schema,
output_schema=output_schema,
dynamic_batching_config=dynamic_batch_config)
llm_pipeline.clear()
llm_pipeline.add_model_step(llm_model)
llm_pipeline.deploy(deployment_config)
Tutorials
Troubleshooting
Exceed batch_size_limit
If the dynamic batching configuration batch_size_limit
is exceeded, the following error message is returned:
Maximum batch size exceeded, please configure a higher limit or contact wallaroo for support at support@wallaroo.ai.
Dynamic Batching with Single Batch Configuration Error
Dynamic Batch Configurations and Single Batch Configuration can not be combined together. If a Dynamic Batch Configuration and Single Batch are added to a LLM’s configuration, the following error is returned:
“Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai”