Dynamic Batching with Llama 3 8B with Llama.cpp CPUs Tutorial


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Dynamic Batching with Llama 3 8B with Llama.cpp CPUs Tutorial

When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processed at once. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.

The following tutorial demonstrates configuring a Llama 3 8B Instruct vLLM with a Wallaroo Dynamic Batching Configuration.

This example uses the Llama V3 8B quantized with Llama.cpp LLM. For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:

Tutorial Overview

This tutorial demonstrates using Wallaroo to:

  • Upload a LLM
  • Define a Dynamic Batching Configuration and apply it to the LLM.
  • Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Dynamic Batch Configuration is applied at the LLM level, so it inherited during deployment.
  • Demonstrate how to perform a sample inference.

Requirements

The following tutorial requires the following:

  • Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
  • Wallaroo version 2024.4 and above.

Tutorial Steps

Import libraries

The first step is to import the libraries required.

import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
from wallaroo.dynamic_batching_config import DynamicBatchingConfig

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload Model

For our example, we’ll upload the model via the Wallaroo SDk and the wallaroo.client.Client.upload_model method which takes the following parameters:

ParameterTypeDescription
nameString (Required)The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
pathString (Required)The path to the model file being uploaded.
frameworkString (Required)Set as the Framework.ONNX.
input_schemapyarrow.lib.Schema (Optional)The input schema in Apache Arrow schema format.
output_schemapyarrow.lib.Schema (Optional)The output schema in Apache Arrow schema format.
convert_waitBoolean (Optional) (Default: True)Not required for native runtimes.
  • True: Waits in the script for the model conversion completion.
  • False: Proceeds with the script without waiting for the model conversion process to display complete.

A dynamic batching configuration is applied with the wallaroo.client.Client.upload_model.configure with following parameters.

ParameterTypeDescription
dynamic_batching_configwallaroo.DynamicBatchingConfig (Default: None)Sets the dynamic batch config to apply to the model.
input_schemapyarrow.lib.Schema (Required)The input schema in Apache Arrow schema format. This field is required when the dynamic_batch_config parameter is set.
output_schemapyarrow.lib.Schema (Required)The output schema in Apache Arrow schema format. This field is required when the dynamic_batch_config parameter is set.
batch_configStringBatch config is either None for multiple-input inferences, or single to accept an inference request with only one row of data. This setting is mutually exclusive with dynamic_batching_config. If dynamic_batching_config is set, batch_config must be None. If batch_config is set to single and a dynamic_batch_config is set, the following error is returned: Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai.
input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("generated_text", pa.string())
])
model = wl.upload_model('llama-cpp-sdk-dynbatch2', 
    'byop_llamacpp.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
).configure(input_schema=input_schema,
            output_schema=output_schema,
            dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000, 
                                                          batch_size_target=8)
            )
model
Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime..............successful

Ready
Namellama-cpp-sdk-dynbatch2
Version0fb39697-c5ee-4c91-8346-3d05783efe19
File Namebyop_llamacpp.zip
SHAe44db803330cdfdb889c79fb6b5297bccd2b81640d5023b05db9b3845b31e91b
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.3.0-main-5713
Architecturex86
Accelerationnone
Updated At2024-03-Oct 19:22:28
Workspace id28
Workspace nameyounes.amar@wallaroo.ai - Default Workspace

Deploy LLM with Dynamic Batch Configuration

Deploying a LLM with a Dynamic Batch configuration requires the same steps as deploying a LLM without a Dynamic Batch configuration:

  • Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
  • Create a Wallaroo pipeline and add the LLM with the Dynamic Batch configuration as a model step.
  • Deploy the Wallaroo pipeline with the deployment configuration.

The deployment configuration sets what resources are allocated to the LLM upon deployment. For this example, we allocate the following resources:

  • cpus: 4
  • memory: 10Gi
  • gpus: 1
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '10Gi') \
    .build()

We create the pipeline with the wallaroo.client.Client.build_pipeline method.

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM. For Dynamic Batching, only one pipeline step is allowed.

The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.

With LLM, deployment configuration, and pipeline ready, we can deploy. Note that the Dynamic Batch Config is not specified during the deployment - that is assigned to the LLM, and inherits those settings for its deployment.

pipeline = wl.build_pipeline("llamacpp-pipeyns-dynbatch2")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.3.14',
   'name': 'engine-6749ff446f-zftzd',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llamacpp-pipeyns-dynbatch2',
      'status': 'Running',
      'version': '56c41ea8-3a5d-44f4-9513-829ae544ab72'}]},
   'model_statuses': {'models': [{'model_version_id': 124,
      'name': 'llama-cpp-sdk-dynbatch2',
      'sha': 'e44db803330cdfdb889c79fb6b5297bccd2b81640d5023b05db9b3845b31e91b',
      'status': 'Running',
      'version': '0fb39697-c5ee-4c91-8346-3d05783efe19'}]}}],
 'engine_lbs': [{'ip': '10.4.2.5',
   'name': 'engine-lb-6b59985857-qtcfd',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.0.5',
   'name': 'engine-sidekick-llama-cpp-sdk-dynbatch2-124-74958d9794-cqgsk',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Sample Inference

Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer method, which accepts either a pandas DataFrame or an Apache Arrow table.

For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request.

data = pd.DataFrame({'text': ['Describe what roland garros is']})
result=pipeline.infer(data, timeout=10000)
result["out.generated_text"][0]
" Roland Garros, also known as the French Open, is a major tennis tournament that takes place in Paris, France every June. It is one of the four Grand Slam tennis tournaments held annually around the world, along with the Australian Open, Wimbledon, and the US Open. The tournament is named after the French aviator Roland Garros, who was a pioneer in the field of aircraft design and construction. The tournament was first played in 1891 and has been held continuously ever since, except for a few years during World War I and II. It is one of the most prestigious tennis tournaments in the world and attracts many of the top players from around the globe. The tournament is played on clay courts, which are known for their slow speed and high traction, making it a challenging surface for players to navigate. The Roland Garros tournament typically takes place over a two-week period in late May and early June, with the men's and women's singles competitions being the most highly anticipated events."

Undeploy LLM

With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.

pipeline.undeploy()