Continuous Batching for Custom Llama with vLLM
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Continuous Batching for Custom Llama with vLLM
Wallaroo’s continuous batching feature using the vLLM runtime provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.
Wallaroo continuous batching is supported with vLLM across two different autopackaging scenarios:
wallaroo.framework.Framework.VLLM
: Native async vLLM implementations in Wallaroo compatible with NVIDIA CUDA.wallaroo.framework.Framework.CUSTOM
: Custom async vLLM implementations in Wallaroo using BYOP (Bring Your Own Predict) provide greater flexibility through a lightweight Python interface.
For more details, see Continuous Batching for LLMs.
This tutorial demonstrates deploying the Llama V3 Instruct LLM with continuous batching in Wallaroo with CUDA AI Acceleration with the Custom Framework. For access to these sample models and for a demonstration of how to use Continuous Batching to improve LLM performance:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Tutorial Overview
This tutorial demonstrates using Wallaroo to:
- Upload a LLM with the following options:
- Framework:
Custom
. The Wallaroo Custom Model for this tutorial includes extensions to enable continuous batching with its deployment. - Framework Configuration to specify LLM options.
- Framework:
- Define a Continuous Batching Configuration and apply it to the LLM model configuration.
- Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Framework Configuration is applied at the LLM level, so it inherited during deployment.
- Demonstrate how to perform a sample inference.
Requirements
The following tutorial requires the following:
- Custom Llama vLLM encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
- Wallaroo version 2025.1 and above.
Tutorial Steps
Library Imports
We start by importing the libraries used for this tutorial, including the Wallaroo SDK. This is provided by default when executing this Jupyter Notebook in the Wallaroo JupyterHub service.
import base64
import wallaroo
import pyarrow as pa
import pandas as pd
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import CustomConfig
Connect to the Wallaroo Instance
The next step to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Custom vLLM Framework Requirements
Custom vLLM deployments in Wallaroo use the Custom Model aka BYOP framework. The following is a summary of the requirements for using Continuous Batching with Custom vLLM deployments.
Custom vLLM deployments include Python scripts that extend the Wallaroo SDK mac.inference.Inference
and mac.inference.creation.InferenceBuilder
. For Continuous Batching support, the following additions are required:
In the
requirements.txt
file, thevllm
library must be included. For optional performance, use the version specified below.vllm==0.6.6
Import the following libraries into the Python script that extends the
mac.inference.Inference
andmac.inference.creation.InferenceBuilder
:from vllm import AsyncLLMEngine, SamplingParams from vllm.engine.arg_utils import AsyncEngineArgs
The class that accepts
InferenceBuilder
extends:def inference(self) -> AsyncVLLMInference
: Specifies the Inference instance used bycreate
.def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
Creates the inference subclass and adds the vLLM for use with the inference requests.
The following shows an example of extending the inference
and create
to for AsyncVLLMInference
. The entire code is available as part of this tutorials artifacts under ./models/main.py
.
class AsyncVLLMInferenceBuilder(InferenceBuilder):
"""Inference builder class for AsyncVLLMInference."""
@property
def inference(self) -> AsyncVLLMInference:
"""Returns an Inference subclass instance.
This specifies the Inference instance to be used
by create() to build additionally needed components."""
return AsyncVLLMInference()
def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
"""Creates an Inference subclass and assigns a model to it.
:param config: Inference configuration
:return: Inference subclass
"""
inference = self.inference
inference.model = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(
model=(config.model_path / "model").as_posix(),
),
)
return inference
Upload Model Custom vLLM Runtime
Custom vLLM Runtimes are uploaded either via the Wallaroo SDK or the Wallaroo MLOps API. The following procedures demonstrate both methods.
Define Input and Output Schemas
The input and output schemas are defined in Apache pyarrow format.
input_schema = pa.schema([
pa.field('prompt', pa.string()),
pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
pa.field('generated_text', pa.string()),
pa.field('num_output_tokens', pa.int64())
])
Upload Custom vLLM Runtime via the MLOps API
Wallaroo provides the Wallaroo MLOps API. For full details on using the Wallaroo MLOps API including client connections, endpoints, etc, see the Wallaroo API Guide.
Models are uploaded via the Wallaroo MLOps API via the following endpoint:
/v1/api/models/upload_and_convert
The parameters for this endpoint include:
- The name assigned to the LLM in Wallaroo.
- The workspace the model is assigned to.
- The inputs and output schema.
- Any optional framework configurations to optimize LLM performance.
- The path of the LLM file.
The following code sample demonstrates uploading a Custom vLLM Framework runtime with the framework configuration via the Wallaroo MLOps API, then retrieving the model version from the Wallaroo SDK.
We start by converting the input and output schemas to base64.
base64.b64encode(
bytes(input_schema.serialize())
).decode("utf8")
base64.b64encode(
bytes(output_schema.serialize())
).decode("utf8")
Run the following command in order to upload the model via the Wallaroo MLOps API via curl
.
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer <your-auth-token-here>" \
-F 'metadata={"name": "byop-vllm-tinyllama-async-fc-v3", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "custom"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
-F "file=@byop-tinyllama-custom-config.zip;type=application/octet-stream" \
https://benchmarkscluster.wallarooexample.ai/v1/api/models/upload_and_convert | cat
The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model
for additional configuration and deployment options.
# Retrieve the model
custom_framework_model = wl.get_model("byop-vllm-tinyllama-async-fc-v3")
custom_framework_model
Upload the LLM via the Wallaroo SDK
The model is uploaded via the Wallaroo SDK method wallaroo.client.Client.upload_model
which sets the following:
- The name assigned to the LLM in Wallaroo.
- The inputs and output schema.
- Any optional framework configurations to optimize LLM performance defined by the
wallaroo.framework.CustomConfig
object.- Any
CustomConfig
parameters not defined at model upload are set to the default values.
- Any
- The path of the LLM file.
Define CustomConfig
We define the wallaroo.framework.CustomConfig
object and set the values.
For this example, the CustomConfig
parameters are set with the following:
gpu_memory_utilization=0.9
max_model_len=128
Other parameters not defined here use the default values.
custom_framework_config = CustomConfig(
gpu_memory_utilization=0.9,
max_model_len=128
)
Upload model via the Wallaroo SDK
With our values set, we upload the model with the wallaroo.client.Client.upload_model
method with the following parameters:
- Model name and path to the Custom Llama LLM.
framework_config
set to our definedCustomConfig
.- Input and output schemas.
accel
set tofrom wallaroo.engine_config.Acceleration.CUDA
.
custom_framework_model = wl.upload_model(
"byop-vllm-tinyllama-ynsv5",
"./byop_tinyllama_vllm_v4.zip",
framework=Framework.CUSTOM,
framework_config=custom_framework_config,
input_schema=input_schema,
output_schema=output_schema,
accel=Acceleration.CUDA
)
custom_framework_model
Waiting for model loading - this will take up to 10min.
.odel is pending loading to a container runtime.
.............................successfulner runtime.
Ready
Name | byop-vllm-tinyllama-ynsv5 |
Version | 4b40ba86-8af1-4945-bde6-137245d5e618 |
File Name | byop_tinyllama_vllm_v4.zip |
SHA | 5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132 |
Architecture | x86 |
Acceleration | cuda |
Updated At | 2025-08-May 18:22:35 |
Workspace id | 60 |
Workspace name | sample.user@wallaroo.ai - Default Workspace |
Set Continuous Batching Configuration
The model configuration is set either during model upload or post model upload. We define the continuous batching configuration with the max current batch size set to 100
, then apply it to the model configuration.
If the max_concurrent_batch_size
is not specified it is set to the default to the value of 256
.
When applying a continuous batch configuration to a model configuration, the input and output schemas must be included.
# Define continuous batching for Async vLLM (you can choose the number of connections you want)
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)
custom_framework_with_continuous_batching = custom_framework_model.configure(
input_schema = input_schema,
output_schema = output_schema,
continuous_batching_config = cbc
)
custom_framework_with_continuous_batching
Name | byop-vllm-tinyllama-ynsv5 |
Version | 4b40ba86-8af1-4945-bde6-137245d5e618 |
File Name | byop_tinyllama_vllm_v4.zip |
SHA | 5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132 |
Architecture | x86 |
Acceleration | cuda |
Updated At | 2025-08-May 18:22:35 |
Workspace id | 60 |
Workspace name | sample.user@wallaroo.ai - Default Workspace |
Deploy LLMs Using the Custom Wallaroo vLLM Runtime with Continuous Batch Configuration
Models are deployed in Wallaroo via Wallaroo Pipelines through the following process.
- Create a deployment configuration. If no deployment configuration is specified, then the default values are used. For our deployment, we specify the LLM is assigned the following resources:
- 1 cpu
- 10 Gi RAM
- 1 gpu from the nodepool
"wallaroo.ai/accelerator:a100"
. Wallaroo deployments and pipelines inherit the acceleration settings from the model, so this will beCUDA
.
- Create the Wallaroo pipeline.
- Assign the model as a pipeline step to processing incoming data and return the inference results.
- Deploy the pipeline with the pipeline configuration.
Define the Deployment Configuration
The deployment configuration allocates resources for the LLM’s exclusive use. These resources are used by the LLM until the pipeline is undeployed and the resources returned.
deployment_config = DeploymentConfigBuilder() \
.cpus(1.).memory('1Gi') \
.sidekick_cpus(custom_framework_with_continuous_batching, 1.) \
.sidekick_memory(custom_framework_with_continuous_batching, '10Gi') \
.sidekick_gpus(custom_framework_with_continuous_batching, 1) \
.deployment_label("wallaroo.ai/accelerator:t4-shared") \
.build()
Deploy the LLM pipeline With the Custom vLLM Runtime and Continuous Batching Configurations
The next steps we deploy the model by creating the pipeline, adding the vLLM as the pipeline step, and deploying the pipeline with the deployment configuration.
Once complete, the model is ready to accept inference requests.
pipeline = wl.build_pipeline("byop-tinyllama-cutom-vllm")
pipeline.undeploy()
pipeline.clear()
pipeline.add_model_step(batch)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.4.7.8',
'name': 'engine-65bc55d64f-mdrnh',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'byop-tinyllama-cutom-vllm',
'status': 'Running',
'version': '95a07681-e434-4108-8e9c-01c052b7b5ec'}]},
'model_statuses': {'models': [{'model_version_id': 434,
'name': 'byop-vllm-tinyllama-ynsv5',
'sha': '5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e',
'status': 'Running',
'version': '4b40ba86-8af1-4945-bde6-137245d5e618'}]}}],
'engine_lbs': [{'ip': '10.4.1.15',
'name': 'engine-lb-5cf49f9d5f-dkvsz',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.4.7.9',
'name': 'engine-sidekick-byop-vllm-tinyllama-ynsv5-434-5cc6f466fc-zqzbk',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Inference
Inference requests are submitted to deployed models as either pandas DataFrames or Apache Arrow tables. The inference data must match the input schemas defined earlier.
Our sample inference request submits a pandas DataFrame with a simple prompt and the max_tokens
field set to 200
. We receive a pandas DataFrame in return with the outputs labeled as out.{variable_name}
, with variable_name
matching the output schemas defined at model upload.
data = pd.DataFrame({"prompt": ["What is Wallaroo.AI?"], "max_tokens": [200]})
pipeline.infer(data, timeout=600)
time | in.max_tokens | in.prompt | out.generated_text | out.num_output_tokens | anomaly.count | |
---|---|---|---|---|---|---|
0 | 2025-05-08 18:41:35.436 | 200 | What is Wallaroo.AI? | \n2.2 How does Wallaroo.AI's Asset Composition... | 200 | 0 |
Undeploy
With the tutorial complete, the pipeline is undeployed to return the resources back to the Wallaroo environment.
pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ..................................... ok
name | byop-tinyllama-demo-yns-cudafix |
---|---|
created | 2025-05-08 18:23:23.012161+00:00 |
last_updated | 2025-05-08 18:23:23.094326+00:00 |
deployed | False |
workspace_id | 60 |
workspace_name | sample.user@wallaroo.ai - Default Workspace |
arch | x86 |
accel | cuda |
tags | |
versions | 2ae66497-d235-44b5-8be5-52a6b83cf945, 2c8d7c28-1702-4e6a-9805-c8f5b918ab36 |
steps | byop-vllm-tinyllama-ynsv5 |
published | False |