Dynamic Batching with Llama 3 8B Instruct vLLM Tutorial
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Dynamic Batching with Llama 3 8B Instruct vLLM Tutorial
When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processed at once. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.
The following tutorial demonstrates configuring a Llama 3 8B Instruct vLLM with a Wallaroo Dynamic Batching Configuration.
This example uses the Llama V3 Instruct LLM. For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Tutorial Overview
This tutorial demonstrates using Wallaroo to:
- Upload a LLM
- Define a Dynamic Batching Configuration and apply it to the LLM.
- Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Dynamic Batch Configuration is applied at the LLM level, so it inherited during deployment.
- Demonstrate how to perform a sample inference.
Requirements
The following tutorial requires the following:
- Llama V3 Instruct LLM encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
- Wallaroo version 2024.4 and above.
Tutorial Steps
Import libraries
The first step is to import the libraries required.
import base64
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.object import EntityNotFoundError
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Upload Model
For our example, we’ll upload the model using the Wallaroo MLOps API. The method wallaroo.client.Client.generate_upload_model_api_command
generates a curl
script for uploading models to Wallaroo via the Wallaroo MLOps API. The generated curl
script is based on the Wallaroo SDK user’s current workspace. This is useful for environments that do not have the Wallaroo SDK installed, or uploading very large models (10 gigabytes or more).
This method takes the following parameters:
Parameter | Type | Description |
---|---|---|
base_url | String (Required) | The Wallaroo domain name. For example: wallaroo.example.com . |
name | String (Required) | The name to assign the model at upload. This must match DNS naming conventions. |
path | String (Required) | Path to the ML or LLM model file. |
framework | String (Required) | The framework from wallaroo.framework.Framework For a complete list, see Wallaroo Supported Models. |
input_schema | String (Required) | The model’s input schema in PyArrow.Schema format. |
output_schema | String (Required) | The model’s output schema in PyArrow.Schema format. |
This outputs a curl
command in the following format (indentions added for emphasis). The sections marked with {}
represent the variable names that are injected into the script from the above parameter or from the current SDK session:
{Current Workspace['id']}
: The value of theid
for the current workspace.{Bearer Token}
: The bearer token used to authentication to the Wallaroo MLOps API.
Define And Encode the Schemas
We define the input and output schemas in Apache PyArrow format.
input_schema = pa.schema([
pa.field('text', pa.string()),
])
output_schema = pa.schema([
pa.field('generated_text', pa.string())
])
Generate the curl Command
We generate the curl
command through the generate_upload_model_api_command
as follows - replace the base_url
with the one used for your Wallaroo instance.
Use the curl command to upload the model to the Wallaroo instance.
wl.generate_upload_model_api_command(
base_url='https://example.wallaroo.ai/',
name='llama3-8b-vllm-max-tokens-no-lock-v1',
path='byop-llama-3-80b-instruct.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema)
curl --progress-bar -X POST -H "Content-Type: multipart/form-data" -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJYNFc1RzNBb19qTnRDSlJNRXhOVTM1X1NrNHYwMVdRQVRabm9WUC1NQ0RZIn0.eyJleHAiOjE3MjgwNTgxNjYsImlhdCI6MTcyODA1ODEwNiwianRpIjoiYWZjYTQwZGMtZTI3Yi00YjgwLThmMmQtMzUyNmM2ODE5YWI3IiwiaXNzIjoiaHR0cHM6Ly9kb2MtdGVzdC53YWxsYXJvb2NvbW11bml0eS5uaW5qYS9hdXRoL3JlYWxtcy9tYXN0ZXIiLCJhdWQiOlsibWFzdGVyLXJlYWxtIiwiYWNjb3VudCJdLCJzdWIiOiJmNzVhODYyOS03MGFiLTQxMDAtOGIzNy0wNGNmNzllNjY3ZWUiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJzZGstY2xpZW50Iiwic2Vzc2lvbl9zdGF0ZSI6ImVjZTIzMzJjLTA0NGItNDU3Ni05MWNjLTRmYzEwYTliMzc1ZiIsImFjciI6IjEiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiY3JlYXRlLXJlYWxtIiwiZGVmYXVsdC1yb2xlcy1tYXN0ZXIiLCJvZmZsaW5lX2FjY2VzcyIsImFkbWluIiwidW1hX2F1dGhvcml6YXRpb24iXX0sInJlc291cmNlX2FjY2VzcyI6eyJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsidmlldy1yZWFsbSIsInZpZXctaWRlbnRpdHktcHJvdmlkZXJzIiwibWFuYWdlLWlkZW50aXR5LXByb3ZpZGVycyIsImltcGVyc29uYXRpb24iLCJjcmVhdGUtY2xpZW50IiwibWFuYWdlLXVzZXJzIiwicXVlcnktcmVhbG1zIiwidmlldy1hdXRob3JpemF0aW9uIiwicXVlcnktY2xpZW50cyIsInF1ZXJ5LXVzZXJzIiwibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hdXRob3JpemF0aW9uIiwibWFuYWdlLWNsaWVudHMiLCJxdWVyeS1ncm91cHMiXX0sImFjY291bnQiOnsicm9sZXMiOlsibWFuYWdlLWFjY291bnQiLCJtYW5hZ2UtYWNjb3VudC1saW5rcyIsInZpZXctcHJvZmlsZSJdfX0sInNjb3BlIjoib3BlbmlkIGVtYWlsIHByb2ZpbGUiLCJzaWQiOiJlY2UyMzMyYy0wNDRiLTQ1NzYtOTFjYy00ZmMxMGE5YjM3NWYiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiaHR0cHM6Ly9oYXN1cmEuaW8vand0L2NsYWltcyI6eyJ4LWhhc3VyYS11c2VyLWlkIjoiZjc1YTg2MjktNzBhYi00MTAwLThiMzctMDRjZjc5ZTY2N2VlIiwieC1oYXN1cmEtdXNlci1lbWFpbCI6ImpvaG4uaGFuc2FyaWNrQHdhbGxhcm9vLmFpIiwieC1oYXN1cmEtZGVmYXVsdC1yb2xlIjoiYWRtaW5fdXNlciIsIngtaGFzdXJhLWFsbG93ZWQtcm9sZXMiOlsidXNlciIsImFkbWluX3VzZXIiXSwieC1oYXN1cmEtdXNlci1ncm91cHMiOiJ7fSJ9LCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJqb2huLmhhbnNhcmlja0B3YWxsYXJvby5haSIsImVtYWlsIjoiam9obi5oYW5zYXJpY2tAd2FsbGFyb28uYWkifQ.L8JQOjXeNOLzhwSVzcDFUpdKHWzgN_H5DT12kh2CRb-mMJNeFlFzQT99piTgsv6QmYWXbaLU1aM86aQYjJX_5tdXj5CfttvJvM3xVovOiv_yt_7A1VGQa6zILAA-zYws5o6f9HesK6dOatuuWrl_vZyEMvalJom_IH3vQxEjPwKiGsZuirH38XEUmFcoLggxXTnfTtIyho_4Fl_X74U1lL-Xpw7LT8P7B9XzRs_ix4tO8HRTGuuxBj6IdeWxDlP6ZDVf_UTAeZGHAf6xguvX2DUZu1-49qVqcaA34hQzeTW_QnitGhcmQ4eoJYUbQCvM8EPDB1Uj37PgfctBx6xc9A" -F "metadata={"name": "llama3-70b-instruct", "visibility": "private", "workspace_id": 6, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json" -F "file=@byop-llama-3-80b-instruct.zip;type=application/octet-stream" https://example.wallaroo.ai//v1/api/models/upload_and_convert
Retrieve the LLM
Once uploaded, we retrieve the LLM with the wallaroo.client.Client.get_models
command, specifying the same model name used during the upload command.
llm_model = wl.get_model('llama3-8b-vllm-max-tokens-no-lock-v1')
llm_model
Name | llama3-8b-vllm-max-tokens-no-lock-v1 |
Version | 096963bc-fc88-483a-9cdc-f606e0d57e26 |
File Name | llama3-8b-vllm.zip |
SHA | b86841ca5d0976f6d688342caebcef2bcdbafd4f8d833ed9060f161a6a8b854c |
Status | ready |
Image Path | ghcr.io/wallaroolabs/mac-deploy:v2024.3.0-main-5654 |
Architecture | x86 |
Acceleration | none |
Updated At | 2024-18-Sep 16:14:16 |
Workspace id | 1 |
Workspace name | panagiotis.vardanis@wallaroo.ai - Default Workspace |
Define the Dynamic Batching Config
The Dynamic Batch Config is configured in the Wallaroo SDK via the from wallaroo.dynamic_batching_config.DynamicBatchingConfig
object, which takes the following parameters.
Parameter | Type | Description |
---|---|---|
max_batch_delay_ms | Integer (Default: 10) | Set the maximum batch delay in milliseconds. |
batch_size_target | Integer (Default: 4) | Set the target batch size; can not be less than or equal to zero. |
batch_size_limit | Integer (Default: None) | Set the batch size limit; can not be less than or equal to zero. This is used to control the maximum batch size. |
For this example, we will configure the LLM with the following Dynamic Batch Config:
max_batch_delay_ms
=1000batch_size_target
=8batch_size_limit
=10
from wallaroo.dynamic_batching_config import DynamicBatchingConfig
llm_model = llm_model.configure(input_schema=input_schema,
output_schema=output_schema,
dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000,
batch_size_target=8,
batch_size_limit=10))
Deploy LLM with Dynamic Batch Configuration
Deploying a LLM with a Dynamic Batch configuration requires the same steps as deploying a LLM without a Dynamic Batch configuration:
- Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
- Create a Wallaroo pipeline and add the LLM with the Dynamic Batch configuration as a model step.
- Deploy the Wallaroo pipeline with the deployment configuration.
The deployment configuration sets what resources are allocated to the LLM upon deployment. For this example, we allocate the following resources:
- cpus: 4
- memory: 15Gi
- gpus: 1
deployment_config = DeploymentConfigBuilder() \
.replica_count(3) \
.cpus(1).memory('2Gi') \
.sidekick_cpus(llm_model, 4) \
.sidekick_memory(llm_model, '15Gi') \
.sidekick_gpus(llm_model, 1) \
.deployment_label("wallaroo.ai/accelerator:a100") \
.build()
We create the pipeline with the wallaroo.client.Client.build_pipeline
method.
Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline
method. Pipeline steps are used to determine how inference data is provided to the LLM. For Dynamic Batching, only one pipeline step is allowed.
The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.
pipeline_name = "llama-3-8b-vllm-dynamic-1000-8-10-4096-lock-fix"
pipeline = wl.build_pipeline(pipeline_name)
pipeline.add_model_step(llm_model)
With LLM, deployment configuration, and pipeline ready, we can deploy. Note that the Dynamic Batch Config is not specified during the deployment - that is assigned to the LLM, and inherits those settings for its deployment.
pipeline.deploy(deployment_config=deployment_config)
Sample Inference
Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer
method, which accepts either a pandas DataFrame or an Apache Arrow table.
For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request.
data = pd.DataFrame({'text': ['Describe what Wallaroo.AI is']})
results = pipeline.infer(data, timeout=10000)
Undeploy LLM
With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.
pipeline.undeploy()
Waiting for undeployment - this will take up to 45s ...................................... ok
name | llama-3-8b-vllm-end-token |
---|---|
created | 2024-09-09 12:32:34.262824+00:00 |
last_updated | 2024-09-09 14:07:15.755138+00:00 |
deployed | False |
workspace_id | 1 |
workspace_name | panagiotis.vardanis@wallaroo.ai - Default Workspace |
arch | x86 |
accel | none |
tags | |
versions | 30d0a13a-3d69-4cb6-83ae-6ef4935f211e, c192a3a7-d5f7-448f-956a-e872c2fc941b, 7d4a2b20-a0bc-47c0-965f-c4940211c0cc |
steps | llama3-8b-vllm-max-tokens-v3 |
published | False |