Continuous Batching for Custom Llama with vLLM


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Continuous Batching for Custom Llama with vLLM

Wallaroo’s continuous batching feature using the vLLM runtime provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.

Wallaroo continuous batching is supported with vLLM across two different autopackaging scenarios:

  • wallaroo.framework.Framework.VLLM: Native async vLLM implementations in Wallaroo compatible with NVIDIA CUDA.
  • wallaroo.framework.Framework.CUSTOM: Custom async vLLM implementations in Wallaroo using BYOP (Bring Your Own Predict) provide greater flexibility through a lightweight Python interface.

For more details, see Continuous Batching for LLMs.

This tutorial demonstrates deploying the Llama V3 Instruct LLM with continuous batching in Wallaroo with CUDA AI Acceleration with the Custom Framework. For access to these sample models and for a demonstration of how to use Continuous Batching to improve LLM performance:

Tutorial Overview

This tutorial demonstrates using Wallaroo to:

  • Upload a LLM with the following options:
    • Framework: Custom. The Wallaroo Custom Model for this tutorial includes extensions to enable continuous batching with its deployment.
    • Framework Configuration to specify LLM options.
  • Define a Continuous Batching Configuration and apply it to the LLM model configuration.
  • Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Framework Configuration is applied at the LLM level, so it inherited during deployment.
  • Demonstrate how to perform a sample inference.

Requirements

The following tutorial requires the following:

  • Custom Llama vLLM encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
  • Wallaroo version 2025.1 and above.

Tutorial Steps

Library Imports

We start by importing the libraries used for this tutorial, including the Wallaroo SDK. This is provided by default when executing this Jupyter Notebook in the Wallaroo JupyterHub service.

import base64
import wallaroo
import pyarrow as pa
import pandas as pd
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import CustomConfig

Connect to the Wallaroo Instance

The next step to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Custom vLLM Framework Requirements

Custom vLLM deployments in Wallaroo use the Custom Model aka BYOP framework. The following is a summary of the requirements for using Continuous Batching with Custom vLLM deployments.

Custom vLLM deployments include Python scripts that extend the Wallaroo SDK mac.inference.Inference and mac.inference.creation.InferenceBuilder. For Continuous Batching support, the following additions are required:

  • In the requirements.txt file, the vllm library must be included. For optional performance, use the version specified below.

    vllm==0.6.6
    
  • Import the following libraries into the Python script that extends the mac.inference.Inference and mac.inference.creation.InferenceBuilder:

    from vllm import AsyncLLMEngine, SamplingParams
    from vllm.engine.arg_utils import AsyncEngineArgs
    
  • The class that accepts InferenceBuilder extends:

    • def inference(self) -> AsyncVLLMInference: Specifies the Inference instance used by create.
    • def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference: Creates the inference subclass and adds the vLLM for use with the inference requests.

The following shows an example of extending the inference and create to for AsyncVLLMInference. The entire code is available as part of this tutorials artifacts under ./models/main.py.

class AsyncVLLMInferenceBuilder(InferenceBuilder):
    """Inference builder class for AsyncVLLMInference."""

    @property
    def inference(self) -> AsyncVLLMInference:
        """Returns an Inference subclass instance.
        This specifies the Inference instance to be used
        by create() to build additionally needed components."""
        return AsyncVLLMInference()

    def create(self, config: CustomInferenceConfig) -> AsyncVLLMInference:
        """Creates an Inference subclass and assigns a model to it.

        :param config: Inference configuration

        :return: Inference subclass
        """
        inference = self.inference
        inference.model = AsyncLLMEngine.from_engine_args(
            AsyncEngineArgs(
                model=(config.model_path / "model").as_posix(),
            ),
        )
        return inference

Upload Model Custom vLLM Runtime

Custom vLLM Runtimes are uploaded either via the Wallaroo SDK or the Wallaroo MLOps API. The following procedures demonstrate both methods.

Define Input and Output Schemas

The input and output schemas are defined in Apache pyarrow format.

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

Upload Custom vLLM Runtime via the MLOps API

Wallaroo provides the Wallaroo MLOps API. For full details on using the Wallaroo MLOps API including client connections, endpoints, etc, see the Wallaroo API Guide.

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

  • /v1/api/models/upload_and_convert

The parameters for this endpoint include:

  • The name assigned to the LLM in Wallaroo.
  • The workspace the model is assigned to.
  • The inputs and output schema.
  • Any optional framework configurations to optimize LLM performance.
  • The path of the LLM file.

The following code sample demonstrates uploading a Custom vLLM Framework runtime with the framework configuration via the Wallaroo MLOps API, then retrieving the model version from the Wallaroo SDK.

We start by converting the input and output schemas to base64.

base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")
base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")

Run the following command in order to upload the model via the Wallaroo MLOps API via curl.

curl --progress-bar -X POST \
    -H "Content-Type: multipart/form-data"  \
    -H "Authorization: Bearer <your-auth-token-here>"  \
    -F 'metadata={"name": "byop-vllm-tinyllama-async-fc-v3", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "custom"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
    -F "file=@byop-tinyllama-custom-config.zip;type=application/octet-stream" \
   https://benchmarkscluster.wallarooexample.ai/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model for additional configuration and deployment options.

# Retrieve the model
custom_framework_model = wl.get_model("byop-vllm-tinyllama-async-fc-v3")
custom_framework_model

Upload the LLM via the Wallaroo SDK

The model is uploaded via the Wallaroo SDK method wallaroo.client.Client.upload_model which sets the following:

  • The name assigned to the LLM in Wallaroo.
  • The inputs and output schema.
  • Any optional framework configurations to optimize LLM performance defined by the wallaroo.framework.CustomConfig object.
    • Any CustomConfig parameters not defined at model upload are set to the default values.
  • The path of the LLM file.

Define CustomConfig

We define the wallaroo.framework.CustomConfig object and set the values.

For this example, the CustomConfig parameters are set with the following:

  • gpu_memory_utilization=0.9
  • max_model_len=128

Other parameters not defined here use the default values.

custom_framework_config = CustomConfig(
        gpu_memory_utilization=0.9, 
        max_model_len=128
    )

Upload model via the Wallaroo SDK

With our values set, we upload the model with the wallaroo.client.Client.upload_model method with the following parameters:

  • Model name and path to the Custom Llama LLM.
  • framework_config set to our defined CustomConfig.
  • Input and output schemas.
  • accel set to from wallaroo.engine_config.Acceleration.CUDA.
custom_framework_model = wl.upload_model(
    "byop-vllm-tinyllama-ynsv5", 
    "./byop_tinyllama_vllm_v4.zip",
    framework=Framework.CUSTOM,
    framework_config=custom_framework_config,
    input_schema=input_schema, 
    output_schema=output_schema,
    accel=Acceleration.CUDA
)
custom_framework_model
Waiting for model loading - this will take up to 10min.
.odel is pending loading to a container runtime.
.............................successfulner runtime.

Ready
Namebyop-vllm-tinyllama-ynsv5
Version4b40ba86-8af1-4945-bde6-137245d5e618
File Namebyop_tinyllama_vllm_v4.zip
SHA5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecturex86
Accelerationcuda
Updated At2025-08-May 18:22:35
Workspace id60
Workspace namesample.user@wallaroo.ai - Default Workspace

Set Continuous Batching Configuration

The model configuration is set either during model upload or post model upload. We define the continuous batching configuration with the max current batch size set to 100, then apply it to the model configuration.

If the max_concurrent_batch_size is not specified it is set to the default to the value of 256.

When applying a continuous batch configuration to a model configuration, the input and output schemas must be included.

# Define continuous batching for Async vLLM (you can choose the number of connections you want)
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)
custom_framework_with_continuous_batching = custom_framework_model.configure(
    input_schema = input_schema,
    output_schema = output_schema,
    continuous_batching_config = cbc
)
custom_framework_with_continuous_batching
Namebyop-vllm-tinyllama-ynsv5
Version4b40ba86-8af1-4945-bde6-137245d5e618
File Namebyop_tinyllama_vllm_v4.zip
SHA5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecturex86
Accelerationcuda
Updated At2025-08-May 18:22:35
Workspace id60
Workspace namesample.user@wallaroo.ai - Default Workspace

Deploy LLMs Using the Custom Wallaroo vLLM Runtime with Continuous Batch Configuration

Models are deployed in Wallaroo via Wallaroo Pipelines through the following process.

  • Create a deployment configuration. If no deployment configuration is specified, then the default values are used. For our deployment, we specify the LLM is assigned the following resources:
    • 1 cpu
    • 10 Gi RAM
    • 1 gpu from the nodepool "wallaroo.ai/accelerator:a100". Wallaroo deployments and pipelines inherit the acceleration settings from the model, so this will be CUDA.
  • Create the Wallaroo pipeline.
  • Assign the model as a pipeline step to processing incoming data and return the inference results.
  • Deploy the pipeline with the pipeline configuration.

Define the Deployment Configuration

The deployment configuration allocates resources for the LLM’s exclusive use. These resources are used by the LLM until the pipeline is undeployed and the resources returned.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1.).memory('1Gi') \
    .sidekick_cpus(custom_framework_with_continuous_batching, 1.) \
    .sidekick_memory(custom_framework_with_continuous_batching, '10Gi') \
    .sidekick_gpus(custom_framework_with_continuous_batching, 1) \
    .deployment_label("wallaroo.ai/accelerator:t4-shared") \
    .build()

Deploy the LLM pipeline With the Custom vLLM Runtime and Continuous Batching Configurations

The next steps we deploy the model by creating the pipeline, adding the vLLM as the pipeline step, and deploying the pipeline with the deployment configuration.

Once complete, the model is ready to accept inference requests.

pipeline = wl.build_pipeline("byop-tinyllama-cutom-vllm")
pipeline.undeploy()
pipeline.clear()

pipeline.add_model_step(batch)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.7.8',
   'name': 'engine-65bc55d64f-mdrnh',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'byop-tinyllama-cutom-vllm',
      'status': 'Running',
      'version': '95a07681-e434-4108-8e9c-01c052b7b5ec'}]},
   'model_statuses': {'models': [{'model_version_id': 434,
      'name': 'byop-vllm-tinyllama-ynsv5',
      'sha': '5e244d5ab73cf718256d1d08b7c0553102215f69c3d70936b2d4b89043499a2e',
      'status': 'Running',
      'version': '4b40ba86-8af1-4945-bde6-137245d5e618'}]}}],
 'engine_lbs': [{'ip': '10.4.1.15',
   'name': 'engine-lb-5cf49f9d5f-dkvsz',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.7.9',
   'name': 'engine-sidekick-byop-vllm-tinyllama-ynsv5-434-5cc6f466fc-zqzbk',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Inference

Inference requests are submitted to deployed models as either pandas DataFrames or Apache Arrow tables. The inference data must match the input schemas defined earlier.

Our sample inference request submits a pandas DataFrame with a simple prompt and the max_tokens field set to 200. We receive a pandas DataFrame in return with the outputs labeled as out.{variable_name}, with variable_name matching the output schemas defined at model upload.

data = pd.DataFrame({"prompt": ["What is Wallaroo.AI?"], "max_tokens": [200]})
pipeline.infer(data, timeout=600)
timein.max_tokensin.promptout.generated_textout.num_output_tokensanomaly.count
02025-05-08 18:41:35.436200What is Wallaroo.AI?\n2.2 How does Wallaroo.AI's Asset Composition...2000

Undeploy

With the tutorial complete, the pipeline is undeployed to return the resources back to the Wallaroo environment.

pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ..................................... ok
namebyop-tinyllama-demo-yns-cudafix
created2025-05-08 18:23:23.012161+00:00
last_updated2025-05-08 18:23:23.094326+00:00
deployedFalse
workspace_id60
workspace_namesample.user@wallaroo.ai - Default Workspace
archx86
accelcuda
tags
versions2ae66497-d235-44b5-8be5-52a6b83cf945, 2c8d7c28-1702-4e6a-9805-c8f5b918ab36
stepsbyop-vllm-tinyllama-ynsv5
publishedFalse