Continuous Batching for LLMs

Continuous batching provides a method for increased performance for serving LLMs in realtime GenAI applications (e.g AI agents) for scaled usage. Wallaroo leverages vLLM as a runtime to maximize LLM performance on GPUs for such applications.
For additional information and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Wallaroo’s continuous batching feature using the vLLM runtime provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.

Wallaroo continuous batching is supported with vLLM across two different autopackaging scenarios:

wallaroo.framework.Framework.VLLM: Native async vLLM implementations in Wallaroo compatible with NVIDIA CUDA.
wallaroo.framework.Framework.CUSTOM: Custom async vLLM implementations in Wallaroo using BYOP (Bring Your Own Predict) provide greater flexibility through a lightweight Python interface.

How Continuous Batching Works

Continuous Batching improves throughput by dynamically grouping incoming requests in real time to optimize inference processing. It’s useful for realtime concurrent inference requests when LLM-based or agentic AI applications run at scale, balancing latency, throughput, and resource use.

vLLM runtime (Paged Attention) configurations are managed through Framework Configurations in Wallaroo at LLM upload time.

IMPORTANT NOTE

Framework Configuration Exceptions

LLMs deployed in Wallaroo with continuous batching have the following exceptions:

Single Batch: vLLM configurations cannot be deployed with a vLLM model configuration with the batch type set to single.
Single Inference Request: Inference requests are sent only one batch at a time per user request. Inference requests with multiple batches from the same user are rejected with an error.

How to Configure Continuous Batching

Continuous batching is applied to either native or custom vLLM runtimes either before or after the LLM is uploaded to Wallaroo. The procedures below show how to:

Upload either a native or custom vLLM runtime to Wallaroo via the Wallaroo SDK or the Wallaroo MLOps API.
Define and apply a continuous batching configuration to an uploaded native or custom vLLM runtime either at upload or after upload.

When Deploying LLMs Using the Native vLLM Runtime in Wallaroo

Continuous Batching is applied to LLMs in Wallaroo as a model configuration either at model upload or post model upload via the Wallaroo SDK.

Upload Native vLLM Runtime to Wallaroo

LLMs in the native vLLM framework are uploaded either via the Wallaroo SDK or the Wallaroo MLOps API with framework configuration options.

Upload Native LLMs with the vLLM Runtime Via the Wallaroo SDK

Continuous batching is configured for LLMs uploaded via the Wallaroo SDK from the following methods:

Define the model upload parameters with wallaroo.client.Client.upload_model method.
- (Optional) Set the upload_model parameter framework_config to specify any vLLM options to increase performance. If no options are specified, the default values are applied.

The method wallaroo.client.Client.upload_model takes the following parameters:

Parameter	Type	Description
`name`	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
`path`	`string` (Required)	The path to the model file being uploaded.
`framework`	`string` (Required)	The framework of the model from `wallaroo.framework.Framework`. For native vLLM, this framework is `wallaroo.framework.Framework.VLLM`.
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format.
`framework_config`	`wallaroo.framework.VLLMConfig` (Optional)	Sets the vLLM framework configuration options.
`accel`	`wallaroo.engine_config.Acceleration`	The AI hardware accelerator used. For vLLM-based deployments with NVIDIA GPUs, set to `wallaroo.engine_config.Acceleration.CUDA`.
`convert_wait`	`bool` (Optional)	True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

wallaroo.framework.VLLMConfig contains the following parameters.

Parameters	Type
max_num_seqs	Integer (Default: 256)
max_model_len	Integer (Default: None)
max_seq_len_to_capture	Integer (Default: 8192)
quantization	(Default: None)
kv_cache_dtype	(Default: `'auto'`)
gpu_memory_utilization	Float (Default: 0.9)
block_size	(Default: None)
device_group	(Default: None) This setting is ignored for CUDA acceleration.

Upload Example for Native vLLM Frameworks via the Wallaroo SDK

The following demonstrates uploading a Native vLLM Runtime with a framework configuration via the Wallaroo SDK.

# (Optional) set the VLLMConfig values

standard_framework_config = wallaroo.framework.VLLMConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm model with the framework configuration values
vllm_model = wl.upload_model(model_name, \
                              model_file_name, \
                              framework=wallaroo.framework.Framework.VLLM, \
                              input_schema=input_schema, \
                              output_schema=output_schema \
                              framework_config=standard_framework_config
                            )

Upload Native LLMs with the vLLM Runtime Via the Wallaroo MLOps API

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

/v1/api/models/upload_and_convert

This endpoint accepts the following parameters.

Field		Type	Description
name		String (Required)	The model name.
visibility		String (Required)	Either `public` or `private`.
workspace_id		String (Required)	The numerical ID of the workspace to upload the model to.
conversion		String (Required)	The conversion parameters that include the following:
	framework	String (Required)	The framework of the model being uploaded. For Native vLLM frameworks, this value is `vllm`
	python_version	String (Required)	The version of Python required for the model. For Native vLLM frameworks, this value is `3.8`.
	requirements	String (Required)	Required libraries. For Native vLLM frameworks, this value is `[]`.
	framework_config	Dict (Optional)	The framework configuration. See the `framework_config` parameters below for further details.
	input_schema	String (Optional)	The input schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`. Only required for Containerized Wallaroo Runtime models.
	output_schema	String (Optional)	The output schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`. Only required for non-native runtime models.

The framework_config parameter accepts the following parameters.

Field		Type
config		Dict The framework configuration values. The following subset are parameters of the `config` field.
	max_num_seqs	Integer (Default: 256)
	max_model_len	Integer (Default: None)
	max_seq_len_to_capture	Integer (Default: 8192)
	quantization	(Default: None)
	kv_cache_dtype	(Default: `'auto'`)
	gpu_memory_utilization	Float (Default: 0.9)
	block_size	(Default: None)
	device_group	(Default: None) This setting is ignored for CUDA acceleration.
framework		String The framework of the `framework_config` type. For Native vLLM frameworks, this value is `"vllm"`.

Upload Example for Native vLLM Runtime via the MLOps API

The following example demonstrates uploading a Native vLLM Framework model with the framework configuration via the Wallaroo MLOps API.

# define the input and output parameters in Apache pyarrow format

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

# convert the input and output values to base64
base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")

# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128

curl --progress-bar -X POST \
   -H "Content-Type: multipart/form-data" \
   -H "Authorization: Bearer <your-auth-token-here>" \
   -F 'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "vllm", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "vllm"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
   -F "file=@<file path to vllm>" \
   https://<Wallaroo Hostname>/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This model version is used to apply the optional Continuous Batching Configuration.

# Retrieve the model
vllm_model = wl.get_model(your-model-name)

Set Continuous Batching Configuration for Native vLLM Runtime

wallaroo.client.Client.upload_model.configure includes the following parameters for continuous batching configurations. If no continuous batching configuration is set, the default values are applied.

Parameter	Type	Description
`continuous_batching_config`	wallaroo.continuous_batching_config.ContinuousBatchingConfig (Default: None)	Sets the continuous batch config to apply to the model. This includes the following parameters: `max_concurrent_batch_size` Type: Integer. Default: `256`.
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format. This field is required when the `continuous_batching_config` parameter is set.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format. This field is required when the `continuous_batching_config` parameter is set.

The following demonstrates defining the continuous batching configuration. Note example the max_concurrent_batch_size is set to 100 to show how to update the value as needed.

# set the continuous batch config max_concurrent_batch_size
# set to 100; defaults to 256
max_concurrent_batch_size = 100
continuous_batch_config = ContinuousBatchingConfig(max_concurrent_batch_size = max_concurrent_batch_size)

The following demonstrates applying the continuous batching configuring during LLM upload via the Wallaroo SDK.

standard_framework_config = wallaroo.framework.VLLMConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm model with the framework configuration values
vllm_model = wl.upload_model(model_name, \
                              model_file_name, \
                              framework=wallaroo.framework.Framework.VLLM, \
                              input_schema=input_schema, \
                              output_schema=output_schema \
                              framework_config=standard_framework_config
                            ).configure(input_schema=input_schema, \
                                        output_schema=output_schema, \
                                        continuous_batch_config=continuous_batch_config \
                                       )

The following demonstrates applying the continuous batching config after LLM upload.

vllm_model.configure(input_schema=input_schema, \
                     output_schema=output_schema, \
                     continuous_batch_config=continuous_batch_config \
                    )

Deploy LLMs Using the Native Wallaroo vLLM Runtime with Continuous Batch Configuration

Deploying LLMs with a Continuous Batching configuration has the following steps.

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per model replica.
Create a Wallaroo pipeline and add the pre-configure LLM with the Continuous Batching configuration as a model step.
Deploy the Wallaroo pipeline with the deployment configuration.

Deployment Configuration for LLMs Deployed Using the Native vLLM Runtime in Wallaroo

The deployment configuration sets what resources are allocated for the LLM. For this example, the LLM is allocated 8 cpus, 10 Gi RAM, and 1 NVIDIA GPU. The specific AI accelerator is inherited from the upload_model step.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(vllm_model, 8) \
    .sidekick_memory(vllm_model, '10Gi') \
    .sidekick_gpus(vllm_model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

Create Wallaroo Pipeline and Add the LLM with Native vLLM Runtime

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM.

The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.

# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-pipeline')

# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(vllm_model)

Deploy the LLM pipeline With the Native vLLM Runtime and Continuous Batching Configurations

With the Deployment Configuration defined and the pipeline ready, the pipeline is deployed with the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig]) method. This allocates resources from the cluster for the deployment based on the DeploymentConfig settings. If the resources requested are not available at deployment, an error is returned.

The following example demonstrates deploying the pipeline with the deployment config.

The following example demonstrates deploying the pipeline with the previously defined deployment configuration.

vllm_pipeline.deploy(deployment_config=deployment_config)

Once the deployment configuration is complete, the pipeline is ready to accept inference requests.

When Deploying LLMs Using the Custom vLLM Runtime in Wallaroo

Continuous Batching is applied to Wallaroo Custom Models as a model configuration either at model upload or post model upload via the Wallaroo SDK.

LLMs deployed with the Custom vLLM Runtime leverage the vLLM runtime; using Wallaroo Custom Models provides additional flexibility in customizing the LLM behavior (e.g adding inputs and outputs, defining system prompts or context queries).

Custom vLLM Runtime Requirements

Wallaroo Custom Model include the following artifacts.

Artifact	Type	Description
Python interface aka `.py` scripts with classes that extend `mac.inference.AsyncInference` and `mac.inference.creation.InferenceBuilder`	Python Script	Extend the classes `mac.inference.Inference` and `mac.inference.creation.InferenceBuilder`. These are included with the Wallaroo SDK. Note that there is no specified naming requirements for the classes that extend `mac.inference.AsyncInference` and `mac.inference.creation.InferenceBuilder` - any qualified class name is sufficient as long as these two classes are extended as defined below.
`requirements.txt`	Python requirements file	This sets the Python libraries used for the Custom Model. These libraries should be targeted for Python 3.10 compliance. These requirements and the versions of libraries should be exactly the same between creating the model and deploying it in Wallaroo. This insures that the script and methods will function exactly the same as during the model creation process.
Other artifacts	Files	Other models, files, and other artifacts used in support of this model.

Custom vLLM Runtime implementations in Wallaroo extend the Wallaroo SDK mac.inference.Inference and mac.inference.creation.InferenceBuilder. For Continuous Batching leveraging a custom vLLM runtime implementation, the following additions are required:

In the requirements.txt file, the vllm library must be included. For optimal performance in Wallaroo, use the version specified below.
```
vllm==0.6.6
```
Import the following libraries into the Python script that extends the mac.inference.Inference and mac.inference.creation.InferenceBuilder:
```
from vllm import AsyncLLMEngine, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
```
The class that accepts InferenceBuilder extends must also extend the following to support continuous batching configurations:
- def inference(self) -> AsyncVLLMInference: Specifies the Inference instance used by create.
- def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference: Creates the inference subclass and specifies the vLLM used with the inference requests.

The following shows an example of extending the inference and create to for AsyncVLLMInference.

# vllm import libraries 
from vllm import AsyncLLMEngine, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs

class AsyncVLLMInferenceBuilder(InferenceBuilder):
    """Inference builder class for AsyncVLLMInference."""

    def inference(self) -> AsyncVLLMInference: # extend mac.inference.AsyncInference
        """Returns an Inference subclass instance.
        This specifies the Inference instance to be used
        by create() to build additionally needed components."""
        return AsyncVLLMInference()

    def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
        """Creates an Inference subclass and assigns a model to it.
        :param config: Inference configuration
        :return: Inference subclass
        """
        inference = self.inference
        inference.model = AsyncLLMEngine.from_engine_args(
            AsyncEngineArgs(
                model=(config.model_path / "model").as_posix(),
            ),
        )
        return inference

Upload Custom vLLM Runtime to Wallaroo

LLMs in the custom vLLM framework are uploaded either via the Wallaroo SDK or the Wallaroo MLOps API with framework configuration options.

Upload LLMs with the Custom vLLM Runtime Via the Wallaroo SDK

The method wallaroo.client.Client.upload_model takes the following parameters:

Parameter	Type	Description
`name`	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
`path`	`string` (Required)	The path to the model file being uploaded.
`framework`	`string` (Required)	The framework of the model from `wallaroo.framework.Framework`. For Custom vLLM, this framework is `wallaroo.framework.Framework.CUSTOM`.
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format.
`framework_config`	`wallaroo.framework.CustomConfig` (Optional)	Sets the Custom Model configuration options. See `wallaroo.framework.CustomConfig` below.
`accel`	`wallaroo.engine_config.Acceleration`	The AI hardware accelerator used. For vLLM-based deployments with NVIDIA GPUs, set to `wallaroo.engine_config.Acceleration.CUDA`.
`convert_wait`	`bool` (Optional)	True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

wallaroo.framework.CustomConfig contains the following parameters.

Parameters	Type
max_num_seqs	Integer (Default: 256)
max_model_len	Integer (Default: None)
max_seq_len_to_capture	Integer (Default: 8192)
quantization	(Default: None)
kv_cache_dtype	(Default: `'auto'`)
gpu_memory_utilization	Float (Default: 0.9)
block_size	(Default: None)
device_group	(Default: `None`) This setting is ignored for CUDA acceleration.

Upload Example for Custom vLLM Runtime via the Wallaroo SDK

The following demonstrates uploading a custom vLLM Runtime with a framework configuration via the Wallaroo SDK.

# (Optional) set the framework configuration values
custom_framework_config = wallaroo.framework.CustomConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm with the model configuration values
custom_vllm_model = wl.upload_model(model_name, 
                              model_file_name, 
                              framework=wallaroo.framework.Framework.CUSTOM, 
                              input_schema=input_schema, 
                              output_schema=output_schema 
                              framework_config=custom_framework_config
                              )

Upload Custom LLMs with the Custom Runtime Via the Wallaroo MLOps API

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

/v1/api/models/upload_and_convert

This endpoint accepts the following parameters.

Field		Type	Description
name		String (Required)	The model name.
visibility		String (Required)	Either `public` or `private`.
workspace_id		String (Required)	The numerical ID of the workspace to upload the model to.
conversion		String (Required)	The conversion parameters that include the following:
	framework	String (Required)	The framework of the model being uploaded. For Custom vLLM frameworks, this value is `custom`
	python_version	String (Required)	The version of Python required for the model. For Custom vLLM frameworks, this value is `3.8`.
	requirements	String (Required)	Required libraries. For Custom vLLM frameworks, this value is `[]`.
	framework_config	Dict (Optional)	The framework configuration.
	input_schema	String (Optional)	The input schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`. Only required for Containerized Wallaroo Runtime models.
	output_schema	String (Optional)	The output schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`. Only required for non-native runtime models.

The framework_config parameter accepts the following parameters.

Field		Type
config		Dict The framework configuration values. The following subset are parameters of the `config` field.
	max_num_seqs	Integer (Default: 256)
	max_model_len	Integer (Default: None)
	max_seq_len_to_capture	Integer (Default: 8192)
	quantization	(Default: None)
	kv_cache_dtype	(Default: `'auto'`)
	gpu_memory_utilization	Float (Default: 0.9)
	block_size	(Default: None)
	device_group	(Default: None) This setting is ignored for CUDA acceleration.
framework		String The framework of the `framework_config` type. For Custom vLLM frameworks, this value is `"custom"`.

Upload Example for Custom vLLM Frameworks via the MLOps API

The following example demonstrates uploading a Custom vLLM Framework model with the framework configuration via the Wallaroo MLOps API, then retrieving the model version from the Wallaroo SDK. This model version is used to apply the optional Continuous Batching Configuration. If no Continuous Batching Configuration is applied, then the default values are applied.

# define the input and output parameters in Apache pyarrow format

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

# convert the input and output values to base64
base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")

# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128

curl --progress-bar -X POST \
   -H "Content-Type: multipart/form-data" \
   -H "Authorization: Bearer <your-auth-token-here>" \
   -F 'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "custom"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
   -F "file=@<file path to custom vllm runtime>" \
   https://<Wallaroo Hostname>/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This model version is used to apply the optional Continuous Batching Configuration.

# Retrieve the model
custom_vllm_model = wl.get_model(your-model-name)

Set Continuous Batching Configuration for Custom vLLM Runtime

Continuous batching configurations applied to a vLLM with the Custom vLLM Framework through the Wallaroo SDK wallaroo.client.Client.upload_model.configure method. If no continuous batching configuration is set, the default values are applied.

This includes the following parameters.

Parameter	Type	Description
`continuous_batching_config`	wallaroo.continuous_batching_config.ContinuousBatchingConfig (Default: None)	Sets the continuous batching config to apply to the LLM. This includes the following parameters: `max_concurrent_batch_size` Type: Integer. Default: `256`. .
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format. This field is required when the `continuous_batching_config` parameter is set.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format. This field is required when the `continuous_batching_config` parameter is set.

The following demonstrates defining the continuous batching configuration. Note example the max_concurrent_batch_size is set to 100 to show how to update the value as needed.

# set the continuous batch config max_concurrent_batch_size
# set to 100; defaults to 256
max_concurrent_batch_size = 100
continuous_batch_config = ContinuousBatchingConfig(max_concurrent_batch_size = max_concurrent_batch_size)

The following demonstrates applying the continuous batching configuring during LLM upload via the Wallaroo SDK.

# (Optional) set the framework configuration values
custom_framework_config = wallaroo.framework.CustomConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm with the model configuration values
custom_vllm_model = wl.upload_model(model_name, 
                              model_file_name, 
                              framework=wallaroo.framework.Framework.CUSTOM, 
                              input_schema=input_schema, 
                              output_schema=output_schema 
                              framework_config=custom_framework_config
                              ).configure(input_schema=input_schema, 
                                         output_schema=output_schema, 
                                         continuous_batch_config=continuous_batch_config 
                            )

The following demonstrates applying the continuous batching config after LLM upload.

custom_vllm_model.configure(input_schema=input_schema, 
                            output_schema=output_schema, 
                            continuous_batch_config=continuous_batch_config 
                           )

Deploy LLMs Using the Custom Wallaroo vLLM Runtime with Continuous Batch Configuration

Deploying a Custom LLM with a Continuous Batching configuration has the following steps.

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
Create a Wallaroo pipeline and add the Custom Model with the Continuous Batch configuration as a model step.
Deploy the Wallaroo pipeline with the deployment configuration.

Deployment Configuration for LLMs Deployed Using the Custom vLLM Runtime in Wallaroo

The deployment configuration sets what resources are allocated for the Custom Model’s use. For this example, the LLM is allocated 8 cpus, 10 Gi RAM, and 1 NVIDIA GPU. The specific AI accelerator is inherited from the upload_model step.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(custom_vllm_model, 8) \
    .sidekick_memory(custom_vllm_model, '10Gi') \
    .sidekick_gpus(custom_vllm_model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

Create Wallaroo Pipeline and Add the LLM with Custom vLLM Runtime

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM.

The following demonstrates creating a Wallaroo pipeline, and adding the LLM as a pipeline step.

# create the pipeline
custom_vllm_pipeline = wl.build_pipeline('sample-vllm-pipeline')

# add the LLM as a pipeline model step
custom_vllm_pipeline.add_model_step(custom_vllm_model)

Deploy the LLM pipeline With the Custom vLLM Runtime and Continuous Batching Configurations

With the Deployment Configuration assigned to the model and the pipeline ready, the pipeline is deployed with the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig]) method. This allocates resources from the cluster for the deployment based on the DeploymentConfig settings. If the resources set in the deployment configuration are not available at deployment, an error is returned.

The following example demonstrates deploying the pipeline with the previously defined deployment configuration:

vllm_pipeline.deploy(deployment_config=deployment_config)

Once the deployment configuration is complete, the pipeline is ready to accept inference requests.

How to Publish for Edge Deployment

Wallaroo pipelines using either native or custom vLLM runtime with continuous batching are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config) command. This uploads the following artifacts to the OCI registry:

The native or custom LLM with:
- vLLM Framework configurations
- Continuous Batching Config
If specified, the deployment configuration
The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.

Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.

For more details, see Edge and Multi-cloud Pipeline Publish.

The following example demonstrates publishing a Wallaroo pipeline with a native vLLM runtime.

pipeline.publish(deployment_config=deployment_config)

Tutorials

The following tutorials demonstrate uploading, deploying and performing sample inferences for LLMs deployed with continuous batching in the vLLM and Custom Model frameworks.

Troubleshooting

The following error messages may appear if the Framework Configuration Exceptions are triggered.

Continuous Batching Not Supported with Single Batch Mode

If the LLM’s configuration sets batch_config="single", the following error message will appear during model upload:

"Continuous batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai”

Continuous Batching Not Supported with Multiple Batches

Inference requests to a LLM deployed with continuous batching must only include a single batch per user request. If a multiple batch inference request is made (aka - an inference request with multiple rows), the following error is returned.

"Continuous batching is not supported with more than one batch at a time, please send requests with a single batch size or contact Wallaroo for support at support@wallaroo.ai"