Dynamic Batching with Llama 3 8B Instruct vLLM Tutorial

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Dynamic Batching with Llama 3 8B Instruct vLLM Tutorial

When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processed at once. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.

The following tutorial demonstrates configuring a Llama 3 8B Instruct vLLM with a Wallaroo Dynamic Batching Configuration.

This example uses the Llama V3 Instruct LLM. For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Tutorial Overview

This tutorial demonstrates using Wallaroo to:

Upload a LLM
Define a Dynamic Batching Configuration and apply it to the LLM.
Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Dynamic Batch Configuration is applied at the LLM level, so it inherited during deployment.
Demonstrate how to perform a sample inference.

Requirements

The following tutorial requires the following:

Llama V3 Instruct LLM encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
Wallaroo version 2024.4 and above.

Tutorial Steps

Import libraries

The first step is to import the libraries required.

import base64

import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.object import EntityNotFoundError

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload Model

For our example, we’ll upload the model using the Wallaroo MLOps API. The method wallaroo.client.Client.generate_upload_model_api_command generates a curl script for uploading models to Wallaroo via the Wallaroo MLOps API. The generated curl script is based on the Wallaroo SDK user’s current workspace. This is useful for environments that do not have the Wallaroo SDK installed, or uploading very large models (10 gigabytes or more).

This method takes the following parameters:

Parameter	Type	Description
base_url	String (Required)	The Wallaroo domain name. For example: `wallaroo.example.com`.
name	String (Required)	The name to assign the model at upload. This must match DNS naming conventions.
path	String (Required)	Path to the ML or LLM model file.
framework	String (Required)	The framework from `wallaroo.framework.Framework` For a complete list, see Wallaroo Supported Models.
input_schema	String (Required)	The model’s input schema in PyArrow.Schema format.
output_schema	String (Required)	The model’s output schema in PyArrow.Schema format.

This outputs a curl command in the following format (indentions added for emphasis). The sections marked with {} represent the variable names that are injected into the script from the above parameter or from the current SDK session:

{Current Workspace['id']}: The value of the id for the current workspace.
{Bearer Token}: The bearer token used to authentication to the Wallaroo MLOps API.

Define And Encode the Schemas

We define the input and output schemas in Apache PyArrow format.

input_schema = pa.schema([
    pa.field('text', pa.string()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string())
])

Generate the curl Command

We generate the curl command through the generate_upload_model_api_command as follows - replace the base_url with the one used for your Wallaroo instance.

Use the curl command to upload the model to the Wallaroo instance.

wl.generate_upload_model_api_command(
    base_url='https://example.wallaroo.ai/',
    name='llama3-8b-vllm-max-tokens-no-lock-v1', 
    path='byop-llama-3-80b-instruct.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema)


    curl --progress-bar -X POST            -H "Content-Type: multipart/form-data"            -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJYNFc1RzNBb19qTnRDSlJNRXhOVTM1X1NrNHYwMVdRQVRabm9WUC1NQ0RZIn0.eyJleHAiOjE3MjgwNTgxNjYsImlhdCI6MTcyODA1ODEwNiwianRpIjoiYWZjYTQwZGMtZTI3Yi00YjgwLThmMmQtMzUyNmM2ODE5YWI3IiwiaXNzIjoiaHR0cHM6Ly9kb2MtdGVzdC53YWxsYXJvb2NvbW11bml0eS5uaW5qYS9hdXRoL3JlYWxtcy9tYXN0ZXIiLCJhdWQiOlsibWFzdGVyLXJlYWxtIiwiYWNjb3VudCJdLCJzdWIiOiJmNzVhODYyOS03MGFiLTQxMDAtOGIzNy0wNGNmNzllNjY3ZWUiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJzZGstY2xpZW50Iiwic2Vzc2lvbl9zdGF0ZSI6ImVjZTIzMzJjLTA0NGItNDU3Ni05MWNjLTRmYzEwYTliMzc1ZiIsImFjciI6IjEiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiY3JlYXRlLXJlYWxtIiwiZGVmYXVsdC1yb2xlcy1tYXN0ZXIiLCJvZmZsaW5lX2FjY2VzcyIsImFkbWluIiwidW1hX2F1dGhvcml6YXRpb24iXX0sInJlc291cmNlX2FjY2VzcyI6eyJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsidmlldy1yZWFsbSIsInZpZXctaWRlbnRpdHktcHJvdmlkZXJzIiwibWFuYWdlLWlkZW50aXR5LXByb3ZpZGVycyIsImltcGVyc29uYXRpb24iLCJjcmVhdGUtY2xpZW50IiwibWFuYWdlLXVzZXJzIiwicXVlcnktcmVhbG1zIiwidmlldy1hdXRob3JpemF0aW9uIiwicXVlcnktY2xpZW50cyIsInF1ZXJ5LXVzZXJzIiwibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hdXRob3JpemF0aW9uIiwibWFuYWdlLWNsaWVudHMiLCJxdWVyeS1ncm91cHMiXX0sImFjY291bnQiOnsicm9sZXMiOlsibWFuYWdlLWFjY291bnQiLCJtYW5hZ2UtYWNjb3VudC1saW5rcyIsInZpZXctcHJvZmlsZSJdfX0sInNjb3BlIjoib3BlbmlkIGVtYWlsIHByb2ZpbGUiLCJzaWQiOiJlY2UyMzMyYy0wNDRiLTQ1NzYtOTFjYy00ZmMxMGE5YjM3NWYiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiaHR0cHM6Ly9oYXN1cmEuaW8vand0L2NsYWltcyI6eyJ4LWhhc3VyYS11c2VyLWlkIjoiZjc1YTg2MjktNzBhYi00MTAwLThiMzctMDRjZjc5ZTY2N2VlIiwieC1oYXN1cmEtdXNlci1lbWFpbCI6ImpvaG4uaGFuc2FyaWNrQHdhbGxhcm9vLmFpIiwieC1oYXN1cmEtZGVmYXVsdC1yb2xlIjoiYWRtaW5fdXNlciIsIngtaGFzdXJhLWFsbG93ZWQtcm9sZXMiOlsidXNlciIsImFkbWluX3VzZXIiXSwieC1oYXN1cmEtdXNlci1ncm91cHMiOiJ7fSJ9LCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJqb2huLmhhbnNhcmlja0B3YWxsYXJvby5haSIsImVtYWlsIjoiam9obi5oYW5zYXJpY2tAd2FsbGFyb28uYWkifQ.L8JQOjXeNOLzhwSVzcDFUpdKHWzgN_H5DT12kh2CRb-mMJNeFlFzQT99piTgsv6QmYWXbaLU1aM86aQYjJX_5tdXj5CfttvJvM3xVovOiv_yt_7A1VGQa6zILAA-zYws5o6f9HesK6dOatuuWrl_vZyEMvalJom_IH3vQxEjPwKiGsZuirH38XEUmFcoLggxXTnfTtIyho_4Fl_X74U1lL-Xpw7LT8P7B9XzRs_ix4tO8HRTGuuxBj6IdeWxDlP6ZDVf_UTAeZGHAf6xguvX2DUZu1-49qVqcaA34hQzeTW_QnitGhcmQ4eoJYUbQCvM8EPDB1Uj37PgfctBx6xc9A"            -F "metadata={"name": "llama3-70b-instruct", "visibility": "private", "workspace_id": 6, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json"            -F "file=@byop-llama-3-80b-instruct.zip;type=application/octet-stream"        https://example.wallaroo.ai//v1/api/models/upload_and_convert

Retrieve the LLM

Once uploaded, we retrieve the LLM with the wallaroo.client.Client.get_models command, specifying the same model name used during the upload command.

llm_model = wl.get_model('llama3-8b-vllm-max-tokens-no-lock-v1')

llm_model

Name	llama3-8b-vllm-max-tokens-no-lock-v1
Version	096963bc-fc88-483a-9cdc-f606e0d57e26
File Name	llama3-8b-vllm.zip
SHA	b86841ca5d0976f6d688342caebcef2bcdbafd4f8d833ed9060f161a6a8b854c
Status	ready
Image Path	ghcr.io/wallaroolabs/mac-deploy:v2024.3.0-main-5654
Architecture	x86
Acceleration	none
Updated At	2024-18-Sep 16:14:16
Workspace id	1
Workspace name	panagiotis.vardanis@wallaroo.ai - Default Workspace

Define the Dynamic Batching Config

The Dynamic Batch Config is configured in the Wallaroo SDK via the from wallaroo.dynamic_batching_config.DynamicBatchingConfig object, which takes the following parameters.

Parameter	Type	Description
`max_batch_delay_ms`	Integer (Default: 10)	Set the maximum batch delay in milliseconds.
`batch_size_target`	Integer (Default: 4)	Set the target batch size; can not be less than or equal to zero.
`batch_size_limit`	Integer (Default: None)	Set the batch size limit; can not be less than or equal to zero. This is used to control the maximum batch size.

For this example, we will configure the LLM with the following Dynamic Batch Config:

max_batch_delay_ms=1000
batch_size_target=8
batch_size_limit=10

from wallaroo.dynamic_batching_config import DynamicBatchingConfig

llm_model = llm_model.configure(input_schema=input_schema, 
                                output_schema=output_schema, 
                                dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000, 
                                                                              batch_size_target=8, 
                                                                              batch_size_limit=10))

Deploy LLM with Dynamic Batch Configuration

Deploying a LLM with a Dynamic Batch configuration requires the same steps as deploying a LLM without a Dynamic Batch configuration:

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
Create a Wallaroo pipeline and add the LLM with the Dynamic Batch configuration as a model step.
Deploy the Wallaroo pipeline with the deployment configuration.

The deployment configuration sets what resources are allocated to the LLM upon deployment. For this example, we allocate the following resources:

cpus: 4
memory: 15Gi
gpus: 1

deployment_config = DeploymentConfigBuilder() \
    .replica_count(3) \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(llm_model, 4) \
    .sidekick_memory(llm_model, '15Gi') \
    .sidekick_gpus(llm_model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

We create the pipeline with the wallaroo.client.Client.build_pipeline method.

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM. For Dynamic Batching, only one pipeline step is allowed.

The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.

pipeline_name = "llama-3-8b-vllm-dynamic-1000-8-10-4096-lock-fix"
pipeline = wl.build_pipeline(pipeline_name)
pipeline.add_model_step(llm_model)

With LLM, deployment configuration, and pipeline ready, we can deploy. Note that the Dynamic Batch Config is not specified during the deployment - that is assigned to the LLM, and inherits those settings for its deployment.

pipeline.deploy(deployment_config=deployment_config)

Sample Inference

Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer method, which accepts either a pandas DataFrame or an Apache Arrow table.

For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request.

data = pd.DataFrame({'text': ['Describe what Wallaroo.AI is']})
results = pipeline.infer(data, timeout=10000)

Undeploy LLM

With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.

pipeline.undeploy()

Waiting for undeployment - this will take up to 45s ...................................... ok

name	llama-3-8b-vllm-end-token
created	2024-09-09 12:32:34.262824+00:00
last_updated	2024-09-09 14:07:15.755138+00:00
deployed	False
workspace_id	1
workspace_name	panagiotis.vardanis@wallaroo.ai - Default Workspace
arch	x86
accel	none
tags
versions	30d0a13a-3d69-4cb6-83ae-6ef4935f211e, c192a3a7-d5f7-448f-956a-e872c2fc941b, 7d4a2b20-a0bc-47c0-965f-c4940211c0cc
steps	llama3-8b-vllm-max-tokens-v3
published	False