Continuous Batching for Llama 3.1 8B with vLLM


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Continuous Batching for Llama 3.1 8B with vLLM

Wallaroo’s continuous batching feature using the vLLM runtime provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.

Wallaroo continuous batching is supported with vLLM across two different autopackaging scenarios:

  • wallaroo.framework.Framework.VLLM: Native async vLLM implementations in Wallaroo compatible with NVIDIA CUDA.
  • wallaroo.framework.Framework.CUSTOM: Custom async vLLM implementations in Wallaroo using BYOP (Bring Your Own Predict) provide greater flexibility through a lightweight Python interface.

For more details on Continuous Batching for vLLMs, see Continuous Batching for LLMs.

This tutorial demonstrates deploying the Llama V3 Instruct LLM with continuous batching in Wallaroo with CUDA AI Acceleration with the Native Framework. For access to these sample models and for a demonstration of how to use Continuous Batching to improve LLM performance:

Tutorial Overview

This tutorial demonstrates using Wallaroo to:

  • Upload a LLM with the following options:
    • Framework: vLLM
    • A Framework Configuration to specify LLM options to optimize performance.
  • Define a Continuous Batching Configuration and apply it to the LLM model configuration.
  • Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Framework Configuration is applied at the LLM level, so it is inherited during deployment.
  • Demonstrate how to perform a sample inference.
  • Demonstrate publishing an Wallaroo pipeline to an Open Container Initiative (OCI) registry for deployment in multi-cloud or edge environments.

Requirements

The following tutorial requires the following:

  • Llama V3 Instruct vLLM. This is available through a Wallaroo representative.
  • Wallaroo version 2025.1 and above.

Tutorial Steps

Library Imports

We start by importing the libraries used for this tutorial, including the Wallaroo SDK. This is provided by default when executing this Jupyter Notebook in the Wallaroo JupyterHub service.

import base64
import wallaroo
import pyarrow as pa
import pandas as pd
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.object import EntityNotFoundError
from wallaroo.framework import VLLMConfig

Connect to the Wallaroo Instance

The next step to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload Model Native vLLM Runtime

Native vLLM Runtimes are uploaded either via the Wallaroo SDK or the Wallaroo MLOps API. The following procedures demonstrate both methods.

Define Input and Output Schemas

For both the Wallaroo SDK and the Wallaroo MLOps API, the input and output schemas must be defined in Apache pyarrow format. The following demonstrates defining those schemas.

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64())
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

Upload Native vLLM Framework via the MLOps API

Wallaroo provides the Wallaroo MLOps API. For full details on using the Wallaroo MLOps API including client connections, endpoints, etc, see the Wallaroo API Guide.

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

  • /v1/api/models/upload_and_convert

The parameters for this endpoint include:

  • The name assigned to the LLM in Wallaroo.
  • The workspace the model is assigned to.
  • The inputs and output schema.
  • Any optional framework configurations to optimize LLM performance.
  • The path of the LLM file.

The following example demonstrates uploading a Native vLLM Framework model with the framework configuration via the Wallaroo MLOps API, then retrieving the model version from the Wallaroo SDK.

We start by converting the input and output schemas to base64.

base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")
base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")

Run the following command in order to upload the model via the Wallaroo MLOps API via curl.

curl --progress-bar -X POST \
   -H "Content-Type: multipart/form-data" \
   -H "Authorization: Bearer <your-auth-token-here>" \
   -F 'metadata={"name": "vllm-llama31-8b-async-fc-v3", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "vllm", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "vllm"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json'\
   -F "file=@llama-31-8b-instruct.zip;type=application/octet-stream" \
   https://benchmarkscluster.wallarooexample.ai/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model for additional configuration and deployment options.

# Retrieve the model
vllm_model = wl.get_model("vllm-llama31-8b-async-fc-v3")

Upload the LLM via the Wallaroo SDK

The model is uploaded via the Wallaroo SDK method wallaroo.client.Client.upload_model which sets the following:

  • The name assigned to the LLM in Wallaroo.
  • The inputs and output schema.
  • Any optional framework configurations to optimize LLM performance defined by the wallaroo.framework.CustomConfig object.
    • Any CustomConfig parameters not defined at model upload are set to the default values.
  • The path of the LLM file.

Define CustomConfig

We define the wallaroo.framework.CustomConfig object and set the values.

For this example, the CustomConfig parameters are set with the following:

  • gpu_memory_utilization=0.9
  • max_model_len=128

Other parameters not defined here use the default values.

vllm_framework_config = VLLMConfig(
        gpu_memory_utilization=0.9, 
        max_model_len=128
    )

Upload model via the Wallaroo SDK

With our values set, we upload the model with the wallaroo.client.Client.upload_model method with the following parameters:

  • Model name and path to the Llama V3 Instruct LLM.
  • framework_config set to our defined VLLMConfig.
  • Input and output schemas.
  • accel set to from wallaroo.engine_config.Acceleration.CUDA.
vllm = wl.upload_model(
    "vllm-llama31-8b-async-demo", 
    "./vLLM_llama-31-8b.zip",
    framework=Framework.VLLM,
    framework_config=vllm_framework_config,
    input_schema=input_schema, 
    output_schema=output_schema,
    accel=Acceleration.CUDA
)
vllm
Waiting for model loading - this will take up to 10min.
.odel is pending loading to a container runtime.
.............................................successful

Ready
Namevllm-llama31-8b-async-demo
Version422d3ad9-1bc7-40c1-99af-0ba109964bfd
File NamevLLM_llama-31-8b.zip
SHA62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecturex86
Accelerationcuda
Updated At2025-08-May 19:24:36
Workspace id60
Workspace namesample.user@wallaroo.ai - Default Workspace

Set Continuous Batching Configuration

The model configuration is set either during model upload or post model upload. We define the continuous batching configuration with the max current batch size set to 100, then apply it to the model configuration.

If the max_concurrent_batch_size is not specified it is set to the default to the value of 256.

When applying a continuous batch configuration to a model configuration, the input and output schemas must be included.

# Define continuous batching for Async vLLM (you can choose the number of connections you want)
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)
vllm_with_continuous_batching = vllm.configure(
    input_schema = input_schema,
    output_schema = output_schema,
    continuous_batching_config = cbc
)
vllm_with_continuous_batching
Namevllm-llama31-8b-async-demo
Version422d3ad9-1bc7-40c1-99af-0ba109964bfd
File NamevLLM_llama-31-8b.zip
SHA62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-main-6132
Architecturex86
Accelerationcuda
Updated At2025-08-May 19:24:36
Workspace id60
Workspace namesample.user@wallaroo.ai - Default Workspace

Deploy LLMs Using the Native Wallaroo vLLM Runtime with Continuous Batch Configuration

Models are deployed in Wallaroo via Wallaroo Pipelines through the following process.

  • Create a deployment configuration. If no deployment configuration is specified, then the default values are used. For our deployment, we specify the LLM is assigned the following resources:
    • 1 cpu
    • 10 Gi RAM
    • 1 gpu from the nodepool "wallaroo.ai/accelerator:a100". Wallaroo deployments and pipelines inherit the acceleration settings from the model, so this will be CUDA.
  • Create the Wallaroo pipeline.
  • Assign the model as a pipeline step to processing incoming data and return the inference results.
  • Deploy the pipeline with the pipeline configuration.

Define the Deployment Configuration

The deployment configuration allocates resources for the LLM’s exclusive use. These resources are used by the LLM until the pipeline is undeployed and the resources returned.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1.).memory('1Gi') \
    .sidekick_cpus(batch, 1.) \
    .sidekick_memory(batch, '10Gi') \
    .sidekick_gpus(batch, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

Deploy the LLM pipeline With the Native vLLM Runtime and Continuous Batching Configurations

The next steps we deploy the model by creating the pipeline, adding the vLLM as the pipeline step, and deploying the pipeline with the deployment configuration.

Once complete, the model is ready to accept inference requests.

pipeline = wl.build_pipeline("llama-31-8b-vllm-demo")
pipeline.clear()
pipeline.undeploy()

pipeline.add_model_step(batch)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.8.2',
   'name': 'engine-8558f6576d-8h7pc',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-31-8b-vllm-demo',
      'status': 'Running',
      'version': '62806288-5f42-44b8-9345-bb4dfb613801'}]},
   'model_statuses': {'models': [{'model_version_id': 443,
      'name': 'vllm-llama31-8b-async-demo',
      'sha': '62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838',
      'status': 'Running',
      'version': '422d3ad9-1bc7-40c1-99af-0ba109964bfd'}]}}],
 'engine_lbs': [{'ip': '10.4.1.17',
   'name': 'engine-lb-5cf49f9d5f-sqr4f',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.8.7',
   'name': 'engine-sidekick-vllm-llama31-8b-async-demo-443-75d58845c-svvll',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Inference

Inference requests are submitted to deployed models as either pandas DataFrames or Apache Arrow tables. The inference data must match the input schemas defined earlier.

Our sample inference request submits a pandas DataFrame with a simple prompt and the max_tokens field set to 200. We receive a pandas DataFrame in return with the outputs labeled as out.{variable_name}, with variable_name matching the output schemas defined at model upload.

data = pd.DataFrame({"prompt": ["What is Wallaroo.AI?"], "max_tokens": [200]})
pipeline.infer(data)
timein.max_tokensin.promptout.generated_textout.num_output_tokensanomaly.count
02025-05-08 19:42:06.259200What is Wallaroo.AI?Cloud and AutoML with Python\nWallaroo.AI is a...1220

Publish Pipeline

Wallaroo pipelines are published to OCI Registries via the wallaroo.pipeline.Pipeline.publish method. This stores the following in the OCI registry:

  • The LLM set as the pipeline step.
  • The Wallaroo engine used to deploy the LLM. The engine used is targeted based on settings inherited from the LLM set during the model upload stage. These settings include:
    • Architecture
    • AI accelerations
    • Framework Configuration
  • The deployment configuration included with as a parameter to the publish command.

For more details on publishing, deploying, and inferencing in multi-cloud and edge with Wallaroo, see Edge and Multi-cloud Model Publish and Deploy.

Note that when published to an OCI registry, the publish command returns the docker run and helm install commands used to deploy the specified LLM.

pipeline.publish(deployment_config=deployment_config)
Waiting for pipeline publish... It may take up to 600 sec.
............................................... Published.
ID36
Pipeline Namellama-31-8b-vllm-demo
Pipeline Versiona5b7a202-9923-4d8d-ba4c-31e22a83cddc
StatusPublished
Workspace Id60
Workspace Namesample.user@wallaroo.ai - Default Workspace
Edges
Engine URLsample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-main-6132
Pipeline URLsample.registry.example.com/uat/pipelines/llama-31-8b-vllm-demo:a5b7a202-9923-4d8d-ba4c-31e22a83cddc
Helm Chart URLoci://sample.registry.example.com/uat/charts/llama-31-8b-vllm-demo
Helm Chart Referencesample.registry.example.com/uat/charts@sha256:af38b73f10fbf6d9da318568d86383b762dee766547a35c30dccf5f7907695e1
Helm Chart Version0.0.1-a5b7a202-9923-4d8d-ba4c-31e22a83cddc
Engine Config{'engine': {'resources': {'limits': {'cpu': 1.0, 'memory': '1Gi'}, 'requests': {'cpu': 1.0, 'memory': '1Gi'}, 'accel': 'cuda', 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none', 'cpu_utilization': 50.0}, 'images': {'vllm-llama31-8b-async-demo-443': {'resources': {'limits': {'cpu': 1.0, 'memory': '10Gi'}, 'requests': {'cpu': 1.0, 'memory': '10Gi'}, 'accel': 'cuda', 'arch': 'x86', 'gpu': True}}}}}
User Images[]
Created Bysample.user@wallaroo.ai
Created At2025-05-08 19:42:16.092419+00:00
Updated At2025-05-08 19:42:16.092419+00:00
Replaces
Docker Run Command
docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=sample.registry.example.com/uat/pipelines/llama-31-8b-vllm-demo:a5b7a202-9923-4d8d-ba4c-31e22a83cddc \
    -e CONFIG_CPUS=1.0 --gpus all --cpus=2.0 --memory=11g \
    sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-main-6132

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Helm Install Command
helm install --atomic $HELM_INSTALL_NAME \
    oci://sample.registry.example.com/uat/charts/llama-31-8b-vllm-demo \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-a5b7a202-9923-4d8d-ba4c-31e22a83cddc \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

Undeploy

With the tutorial complete, the pipeline is undeployed to return the resources back to the Wallaroo environment.

pipeline.undeploy()
namellama-31-8b-vllm-ynsv5
created2025-05-06 12:31:40.360907+00:00
last_updated2025-05-06 19:51:47.490400+00:00
deployedFalse
workspace_id60
workspace_namesample.user@wallaroo.ai - Default Workspace
archx86
accelcuda
tags
versionsb82ed30f-e937-4b49-94d5-63e6e798cc4b, b0a4ab4d-28ee-4470-9391-888a486375d2, 47760536-b263-428d-a9eb-f763c84f8920, 632917ff-0ffd-49be-abca-5a69a6432f93, 18cc0cad-cf6c-4abf-9083-ee90c2e704e2
stepsvllm-llama31-8b-async-ynsv5
publishedFalse

This tutorial demonstrates deploying the Llama V3 Instruct LLM with continuous batching in Wallaroo with CUDA AI Acceleration. For access to these sample models and for a demonstration of how to use Continuous Batching to improve LLM performance: