Dynamic Batching with Llama 3 8B with Llama.cpp CPUs Tutorial
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Dynamic Batching with Llama 3 8B with Llama.cpp CPUs Tutorial
When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processed at once. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.
The following tutorial demonstrates configuring a Llama 3 8B Instruct vLLM with a Wallaroo Dynamic Batching Configuration.
This example uses the Llama V3 8B quantized with llama-cpp LLM. For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Tutorial Overview
This tutorial demonstrates using Wallaroo to:
- Upload a LLM
- Define a Dynamic Batching Configuration and apply it to the LLM.
- Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Dynamic Batch Configuration is applied at the LLM level, so it inherited during deployment.
- Demonstrate how to perform a sample inference.
Requirements
The following tutorial requires the following:
- Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Arbitrary Python aka BYOP Framework. This is available through a Wallaroo representative.
- Wallaroo version 2024.3 and above.
Tutorial Steps
Import libraries
The first step is to import the libraries required.
import json
import os
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
from wallaroo.dynamic_batching_config import DynamicBatchingConfig
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Upload Model
For our example, we’ll upload the model via the Wallaroo SDk and the wallaroo.client.Client.upload_model
method which takes the following parameters:
Parameter | Type | Description |
---|---|---|
name | String (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | String (Required) | The path to the model file being uploaded. |
framework | String (Required) | Set as the Framework.ONNX . |
input_schema | pyarrow.lib.Schema (Optional) | The input schema in Apache Arrow schema format. |
output_schema | pyarrow.lib.Schema (Optional) | The output schema in Apache Arrow schema format. |
convert_wait | Boolean (Optional) (Default: True) | Not required for native runtimes.
|
A dynamic batching configuration is applied with the wallaroo.client.Client.upload_model.configure
with following parameters.
Parameter | Type | Description |
---|---|---|
dynamic_batching_config | wallaroo.DynamicBatchingConfig (Default: None) | Sets the dynamic batch config to apply to the model. |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. This field is required when the dynamic_batch_config parameter is set. |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. This field is required when the dynamic_batch_config parameter is set. |
batch_config | String | Batch config is either None for multiple-input inferences, or single to accept an inference request with only one row of data. This setting is mutually exclusive with dynamic_batching_config . If dynamic_batching_config is set, batch_config must be None . If batch_config is set to single and a dynamic_batch_config is set, the following error is returned: Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai. |
input_schema = pa.schema([
pa.field("text", pa.string())
])
output_schema = pa.schema([
pa.field("generated_text", pa.string())
])
model = wl.upload_model('llama-cpp-sdk-dynbatch2',
'byop_llamacpp.zip',
framework=Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema
).configure(input_schema=input_schema,
output_schema=output_schema,
dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000,
batch_size_target=8)
)
model
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime..............successful
Ready
Name | llama-cpp-sdk-dynbatch2 |
Version | 0fb39697-c5ee-4c91-8346-3d05783efe19 |
File Name | byop_llamacpp.zip |
SHA | e44db803330cdfdb889c79fb6b5297bccd2b81640d5023b05db9b3845b31e91b |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.3.0-main-5713 |
Architecture | x86 |
Acceleration | none |
Updated At | 2024-03-Oct 19:22:28 |
Workspace id | 28 |
Workspace name | younes.amar@wallaroo.ai - Default Workspace |
Deploy LLM with Dynamic Batch Configuration
Deploying a LLM with a Dynamic Batch configuration requires the same steps as deploying a LLM without a Dynamic Batch configuration:
- Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
- Create a Wallaroo pipeline and add the LLM with the Dynamic Batch configuration as a model step.
- Deploy the Wallaroo pipeline with the deployment configuration.
The deployment configuration sets what resources are allocated to the LLM upon deployment. For this example, we allocate the following resources:
- cpus: 4
- memory: 10Gi
- gpus: 1
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '10Gi') \
.build()
We create the pipeline with the wallaroo.client.Client.build_pipeline
method.
Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline
method. Pipeline steps are used to determine how inference data is provided to the LLM. For Dynamic Batching, only one pipeline step is allowed.
The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.
With LLM, deployment configuration, and pipeline ready, we can deploy. Note that the Dynamic Batch Config is not specified during the deployment - that is assigned to the LLM, and inherits those settings for its deployment.
pipeline = wl.build_pipeline("llamacpp-pipeyns-dynbatch2")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.4.3.14',
'name': 'engine-6749ff446f-zftzd',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'llamacpp-pipeyns-dynbatch2',
'status': 'Running',
'version': '56c41ea8-3a5d-44f4-9513-829ae544ab72'}]},
'model_statuses': {'models': [{'model_version_id': 124,
'name': 'llama-cpp-sdk-dynbatch2',
'sha': 'e44db803330cdfdb889c79fb6b5297bccd2b81640d5023b05db9b3845b31e91b',
'status': 'Running',
'version': '0fb39697-c5ee-4c91-8346-3d05783efe19'}]}}],
'engine_lbs': [{'ip': '10.4.2.5',
'name': 'engine-lb-6b59985857-qtcfd',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.4.0.5',
'name': 'engine-sidekick-llama-cpp-sdk-dynbatch2-124-74958d9794-cqgsk',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Sample Inference
Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer
method, which accepts either a pandas DataFrame or an Apache Arrow table.
For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request.
data = pd.DataFrame({'text': ['Describe what roland garros is']})
result=pipeline.infer(data, timeout=10000)
result["out.generated_text"][0]
" Roland Garros, also known as the French Open, is a major tennis tournament that takes place in Paris, France every June. It is one of the four Grand Slam tennis tournaments held annually around the world, along with the Australian Open, Wimbledon, and the US Open. The tournament is named after the French aviator Roland Garros, who was a pioneer in the field of aircraft design and construction. The tournament was first played in 1891 and has been held continuously ever since, except for a few years during World War I and II. It is one of the most prestigious tennis tournaments in the world and attracts many of the top players from around the globe. The tournament is played on clay courts, which are known for their slow speed and high traction, making it a challenging surface for players to navigate. The Roland Garros tournament typically takes place over a two-week period in late May and early June, with the men's and women's singles competitions being the most highly anticipated events."
Undeploy LLM
With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.
pipeline.undeploy()