Dynamic Batching with Llama 3 8B with Llama.cpp CPUs Tutorial

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Dynamic Batching with Llama 3 8B with Llama.cpp CPUs Tutorial

When multiple inference requests are sent from one or multiple clients, a Dynamic Batching Configuration accumulates those inference requests as one “batch”, and processed at once. This increases efficiency and inference result performance by using resources in one accumulated batch rather than starting and stopping for each individual request. Once complete, the individual inference results are returned back to each client.

The following tutorial demonstrates configuring a Llama 3 8B quantized with Llama.cpp with a Wallaroo Dynamic Batching Configuration. For access to these sample models and for a demonstration:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Tutorial Overview

This tutorial demonstrates using Wallaroo to:

Upload a LLM
Define a Dynamic Batching Configuration and apply it to the LLM.
Deploy a the LLM with a Deployment Configuration that allocates resources to the LLM; the Dynamic Batch Configuration is applied at the LLM level, so it inherited during deployment.
Demonstrate how to perform a sample inference.

Requirements

The following tutorial requires the following:

Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
Wallaroo version 2024.4 and above.

Tutorial Steps

Import libraries

The first step is to import the libraries required.

import json
import os

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
from wallaroo.dynamic_batching_config import DynamicBatchingConfig

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload Model

For our example, we’ll upload the model via the Wallaroo SDk and the wallaroo.client.Client.upload_model method which takes the following parameters:

Parameter	Type	Description
`name`	String (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
`path`	String (Required)	The path to the model file being uploaded.
`framework`	String (Required)	Set as the `Framework.ONNX`.
`input_schema`	pyarrow.lib.Schema (Optional)	The input schema in Apache Arrow schema format.
`output_schema`	pyarrow.lib.Schema (Optional)	The output schema in Apache Arrow schema format.
`convert_wait`	Boolean (Optional) (Default: True)	Not required for native runtimes. True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

A dynamic batching configuration is applied with the wallaroo.client.Client.upload_model.configure with following parameters.

Parameter	Type	Description
`dynamic_batching_config`	wallaroo.DynamicBatchingConfig (Default: None)	Sets the dynamic batch config to apply to the model.
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format. This field is required when the `dynamic_batch_config` parameter is set.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format. This field is required when the `dynamic_batch_config` parameter is set.
`batch_config`	String	Batch config is either `None` for multiple-input inferences, or `single` to accept an inference request with only one row of data. This setting is mutually exclusive with `dynamic_batching_config`. If `dynamic_batching_config` is set, `batch_config` must be `None`. If `batch_config` is set to `single` and a `dynamic_batch_config` is set, the following error is returned: `Dynamic batching is not supported with single batch mode. Please update the model configuration or contact wallaroo for support at support@wallaroo.ai.`

input_schema = pa.schema([
    pa.field("text", pa.string())
])

output_schema = pa.schema([
    pa.field("generated_text", pa.string())
])

model = wl.upload_model('llama-cpp-sdk-dynbatch2', 
    'byop_llamacpp.zip',
    framework=Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
).configure(input_schema=input_schema,
            output_schema=output_schema,
            dynamic_batching_config=DynamicBatchingConfig(max_batch_delay_ms=1000, 
                                                          batch_size_target=8)
            )
model

Waiting for model loading - this will take up to 10.0min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime..............successful

Ready

Name	llama-cpp-sdk-dynbatch2
Version	0fb39697-c5ee-4c91-8346-3d05783efe19
File Name	byop_llamacpp.zip
SHA	e44db803330cdfdb889c79fb6b5297bccd2b81640d5023b05db9b3845b31e91b
Status	ready
Image Path	proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.3.0-main-5713
Architecture	x86
Acceleration	none
Updated At	2024-03-Oct 19:22:28
Workspace id	28
Workspace name	younes.amar@wallaroo.ai - Default Workspace

Deploy LLM with Dynamic Batch Configuration

Deploying a LLM with a Dynamic Batch configuration requires the same steps as deploying a LLM without a Dynamic Batch configuration:

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
Create a Wallaroo pipeline and add the LLM with the Dynamic Batch configuration as a model step.
Deploy the Wallaroo pipeline with the deployment configuration.

The deployment configuration sets what resources are allocated to the LLM upon deployment. For this example, we allocate the following resources:

cpus: 4
memory: 10Gi
gpus: 1

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '10Gi') \
    .build()

We create the pipeline with the wallaroo.client.Client.build_pipeline method.

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM. For Dynamic Batching, only one pipeline step is allowed.

The following demonstrates creating a Wallaroo pipeline, and assigning the LLM as a pipeline step.

With LLM, deployment configuration, and pipeline ready, we can deploy. Note that the Dynamic Batch Config is not specified during the deployment - that is assigned to the LLM, and inherits those settings for its deployment.

pipeline = wl.build_pipeline("llamacpp-pipeyns-dynbatch2")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)

pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.3.14',
   'name': 'engine-6749ff446f-zftzd',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llamacpp-pipeyns-dynbatch2',
      'status': 'Running',
      'version': '56c41ea8-3a5d-44f4-9513-829ae544ab72'}]},
   'model_statuses': {'models': [{'model_version_id': 124,
      'name': 'llama-cpp-sdk-dynbatch2',
      'sha': 'e44db803330cdfdb889c79fb6b5297bccd2b81640d5023b05db9b3845b31e91b',
      'status': 'Running',
      'version': '0fb39697-c5ee-4c91-8346-3d05783efe19'}]}}],
 'engine_lbs': [{'ip': '10.4.2.5',
   'name': 'engine-lb-6b59985857-qtcfd',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.0.5',
   'name': 'engine-sidekick-llama-cpp-sdk-dynbatch2-124-74958d9794-cqgsk',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Sample Inference

Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer method, which accepts either a pandas DataFrame or an Apache Arrow table.

For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request.

data = pd.DataFrame({'text': ['Describe what roland garros is']})

result=pipeline.infer(data, timeout=10000)
result["out.generated_text"][0]

" Roland Garros, also known as the French Open, is a major tennis tournament that takes place in Paris, France every June. It is one of the four Grand Slam tennis tournaments held annually around the world, along with the Australian Open, Wimbledon, and the US Open. The tournament is named after the French aviator Roland Garros, who was a pioneer in the field of aircraft design and construction. The tournament was first played in 1891 and has been held continuously ever since, except for a few years during World War I and II. It is one of the most prestigious tennis tournaments in the world and attracts many of the top players from around the globe. The tournament is played on clay courts, which are known for their slow speed and high traction, making it a challenging surface for players to navigate. The Roland Garros tournament typically takes place over a two-week period in late May and early June, with the men's and women's singles competitions being the most highly anticipated events."

Undeploy LLM

With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.

pipeline.undeploy()