Dynamic Batching with Llama 3 8B Instruct vLLM Tutorial


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Dynamic Batching with Llama 3 8B Instruct vLLM Tutorial

When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processed at once. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.

The following tutorial demonstrates configuring a Llama 3 8B Instruct vLLM with a Wallaroo Dynamic Batching Configuration.

This example uses the Llama V3 Instruct LLM. For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:

Tutorial Overview

This tutorial demonstrates using Wallaroo to:

  • Upload a LLM
  • Define a Dynamic Batching Configuration and apply it to the LLM.
  • Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Dynamic Batch Configuration is applied at the LLM level, so it inherited during deployment.
  • Demonstrate how to perform a sample inference.

Requirements

The following tutorial requires the following:

  • Llama V3 Instruct LLM encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
  • Wallaroo version 2024.4 and above.

Tutorial Steps

Import libraries

The first step is to import the libraries required.

import base64

import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.object import EntityNotFoundError

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload Model

For our example, we’ll upload the model using the Wallaroo MLOps API. The method wallaroo.client.Client.generate_upload_model_api_command generates a curl script for uploading models to Wallaroo via the Wallaroo MLOps API. The generated curl script is based on the Wallaroo SDK user’s current workspace. This is useful for environments that do not have the Wallaroo SDK installed, or uploading very large models (10 gigabytes or more).

This method takes the following parameters:

ParameterTypeDescription
base_urlString (Required)The Wallaroo domain name. For example: wallaroo.example.com.
nameString (Required)The name to assign the model at upload. This must match DNS naming conventions.
pathString (Required)Path to the ML or LLM model file.
frameworkString (Required)The framework from wallaroo.framework.Framework For a complete list, see Wallaroo Supported Models.
input_schemaString (Required)The model’s input schema in PyArrow.Schema format.
output_schemaString (Required)The model’s output schema in PyArrow.Schema format.

This outputs a curl command in the following format (indentions added for emphasis). The sections marked with {} represent the variable names that are injected into the script from the above parameter or from the current SDK session:

  • {Current Workspace['id']}: The value of the id for the current workspace.
  • {Bearer Token}: The bearer token used to authentication to the Wallaroo MLOps API.

Define And Encode the Schemas

We define the input and output schemas in Apache PyArrow format.

input_schema = pa.schema([
    pa.field('text', pa.string()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string())
])

Generate the curl Command

We generate the curl command through the generate_upload_model_api_command as follows - replace the base_url with the one used for your Wallaroo instance.

Use the curl command to upload the model to the Wallaroo instance.

wl.generate_upload_model_api_command(
    base_url='https://example.wallaroo.ai/',
    name='llama3-8b-vllm-max-tokens-no-lock-v1', 
    path='byop-llama-3-80b-instruct.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema)
    curl --progress-bar -X POST            -H "Content-Type: multipart/form-data"            -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJYNFc1RzNBb19qTnRDSlJNRXhOVTM1X1NrNHYwMVdRQVRabm9WUC1NQ0RZIn0.eyJleHAiOjE3MjgwNTgxNjYsImlhdCI6MTcyODA1ODEwNiwianRpIjoiYWZjYTQwZGMtZTI3Yi00YjgwLThmMmQtMzUyNmM2ODE5YWI3IiwiaXNzIjoiaHR0cHM6Ly9kb2MtdGVzdC53YWxsYXJvb2NvbW11bml0eS5uaW5qYS9hdXRoL3JlYWxtcy9tYXN0ZXIiLCJhdWQiOlsibWFzdGVyLXJlYWxtIiwiYWNjb3VudCJdLCJzdWIiOiJmNzVhODYyOS03MGFiLTQxMDAtOGIzNy0wNGNmNzllNjY3ZWUiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJzZGstY2xpZW50Iiwic2Vzc2lvbl9zdGF0ZSI6ImVjZTIzMzJjLTA0NGItNDU3Ni05MWNjLTRmYzEwYTliMzc1ZiIsImFjciI6IjEiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiY3JlYXRlLXJlYWxtIiwiZGVmYXVsdC1yb2xlcy1tYXN0ZXIiLCJvZmZsaW5lX2FjY2VzcyIsImFkbWluIiwidW1hX2F1dGhvcml6YXRpb24iXX0sInJlc291cmNlX2FjY2VzcyI6eyJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsidmlldy1yZWFsbSIsInZpZXctaWRlbnRpdHktcHJvdmlkZXJzIiwibWFuYWdlLWlkZW50aXR5LXByb3ZpZGVycyIsImltcGVyc29uYXRpb24iLCJjcmVhdGUtY2xpZW50IiwibWFuYWdlLXVzZXJzIiwicXVlcnktcmVhbG1zIiwidmlldy1hdXRob3JpemF0aW9uIiwicXVlcnktY2xpZW50cyIsInF1ZXJ5LXVzZXJzIiwibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hdXRob3JpemF0aW9uIiwibWFuYWdlLWNsaWVudHMiLCJxdWVyeS1ncm91cHMiXX0sImFjY291bnQiOnsicm9sZXMiOlsibWFuYWdlLWFjY291bnQiLCJtYW5hZ2UtYWNjb3VudC1saW5rcyIsInZpZXctcHJvZmlsZSJdfX0sInNjb3BlIjoib3BlbmlkIGVtYWlsIHByb2ZpbGUiLCJzaWQiOiJlY2UyMzMyYy0wNDRiLTQ1NzYtOTFjYy00ZmMxMGE5YjM3NWYiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiaHR0cHM6Ly9oYXN1cmEuaW8vand0L2NsYWltcyI6eyJ4LWhhc3VyYS11c2VyLWlkIjoiZjc1YTg2MjktNzBhYi00MTAwLThiMzctMDRjZjc5ZTY2N2VlIiwieC1oYXN1cmEtdXNlci1lbWFpbCI6ImpvaG4uaGFuc2FyaWNrQHdhbGxhcm9vLmFpIiwieC1oYXN1cmEtZGVmYXVsdC1yb2xlIjoiYWRtaW5fdXNlciIsIngtaGFzdXJhLWFsbG93ZWQtcm9sZXMiOlsidXNlciIsImFkbWluX3VzZXIiXSwieC1oYXN1cmEtdXNlci1ncm91cHMiOiJ7fSJ9LCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJqb2huLmhhbnNhcmlja0B3YWxsYXJvby5haSIsImVtYWlsIjoiam9obi5oYW5zYXJpY2tAd2FsbGFyb28uYWkifQ.L8JQOjXeNOLzhwSVzcDFUpdKHWzgN_H5DT12kh2CRb-mMJNeFlFzQT99piTgsv6QmYWXbaLU1aM86aQYjJX_5tdXj5CfttvJvM3xVovOiv_yt_7A1VGQa6zILAA-zYws5o6f9HesK6dOatuuWrl_vZyEMvalJom_IH3vQxEjPwKiGsZuirH38XEUmFcoLggxXTnfTtIyho_4Fl_X74U1lL-Xpw7LT8P7B9XzRs_ix4tO8HRTGuuxBj6IdeWxDlP6ZDVf_UTAeZGHAf6xguvX2DUZu1-49qVqcaA34hQzeTW_QnitGhcmQ4eoJYUbQCvM8EPDB1Uj37PgfctBx6xc9A"            -F "metadata={"name": "llama3-70b-instruct", "visibility": "private", "workspace_id": 6, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json"            -F "file=@byop-llama-3-80b-instruct.zip;type=application/octet-stream"        https://example.wallaroo.ai//v1/api/models/upload_and_convert

Retrieve the LLM

Once uploaded, we retrieve the LLM with the wallaroo.client.Client.get_models command, specifying the same model name used during the upload command.

llm_model = wl.get_model('llama3-8b-vllm-max-tokens-no-lock-v1')
llm_model
Namellama3-8b-vllm-max-tokens-no-lock-v1
Version096963bc-fc88-483a-9cdc-f606e0d57e26
File Namellama3-8b-vllm.zip
SHAb86841ca5d0976f6d688342caebcef2bcdbafd4f8d833ed9060f161a6a8b854c
Statusready
Image Pathghcr.io/wallaroolabs/mac-deploy:v2024.3.0-main-5654
Architecturex86
Accelerationnone
Updated At2024-18-Sep 16:14:16
Workspace id1
Workspace namepanagiotis.vardanis@wallaroo.ai - Default Workspace

Define the Dynamic Batching Config

The Dynamic Batch Config is configured in the Wallaroo SDK via the from wallaroo.dynamic_batching_config.DynamicBatchingConfig object, which takes the following parameters.

ParameterTypeDescription
max_batch_delay_msInteger (Default: 10)Set the maximum batch delay in milliseconds.
batch_size_targetInteger (Default: 4)Set the target batch size; can not be less than or equal to zero.
batch_size_limitInteger (Default: None)Set the batch size limit; can not be less than or equal to zero. This is used to control the maximum batch size.

For this example, we will configure the LLM with the following Dynamic Batch Config:

  • max_batch_delay_ms=1000
  • batch_size_target=8
  • batch_size_limit=10
from wallaroo.dynamic_batching_config import DynamicBatchingConfig
llm_model = llm_model.configure(input_schema=input_schema, 
                                output_schema=output_schema, 
                                dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000, 
                                                                              batch_size_target=8, 
                                                                              batch_size_limit=10))

Deploy LLM with Dynamic Batch Configuration

Deploying a LLM with a Dynamic Batch configuration requires the same steps as deploying a LLM without a Dynamic Batch configuration:

  • Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
  • Create a Wallaroo pipeline and add the LLM with the Dynamic Batch configuration as a model step.
  • Deploy the Wallaroo pipeline with the deployment configuration.

The deployment configuration sets what resources are allocated to the LLM upon deployment. For this example, we allocate the following resources:

  • cpus: 4
  • memory: 15Gi
  • gpus: 1
deployment_config = DeploymentConfigBuilder() \
    .replica_count(3) \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(llm_model, 4) \
    .sidekick_memory(llm_model, '15Gi') \
    .sidekick_gpus(llm_model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

We create the pipeline with the wallaroo.client.Client.build_pipeline method.

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM. For Dynamic Batching, only one pipeline step is allowed.

The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.

pipeline_name = "llama-3-8b-vllm-dynamic-1000-8-10-4096-lock-fix"
pipeline = wl.build_pipeline(pipeline_name)
pipeline.add_model_step(llm_model)

With LLM, deployment configuration, and pipeline ready, we can deploy. Note that the Dynamic Batch Config is not specified during the deployment - that is assigned to the LLM, and inherits those settings for its deployment.

pipeline.deploy(deployment_config=deployment_config)

Sample Inference

Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer method, which accepts either a pandas DataFrame or an Apache Arrow table.

For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request.

data = pd.DataFrame({'text': ['Describe what Wallaroo.AI is']})
results = pipeline.infer(data, timeout=10000)

Undeploy LLM

With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.

pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ...................................... ok
namellama-3-8b-vllm-end-token
created2024-09-09 12:32:34.262824+00:00
last_updated2024-09-09 14:07:15.755138+00:00
deployedFalse
workspace_id1
workspace_namepanagiotis.vardanis@wallaroo.ai - Default Workspace
archx86
accelnone
tags
versions30d0a13a-3d69-4cb6-83ae-6ef4935f211e, c192a3a7-d5f7-448f-956a-e872c2fc941b, 7d4a2b20-a0bc-47c0-965f-c4940211c0cc
stepsllama3-8b-vllm-max-tokens-v3
publishedFalse