Autoscaling with Llama 3 8B and Llama.cpp

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Autoscale Triggers with Llama 3 8B with Llama.cpp Tutorial

Wallaroo deployment configurations set what resources are allocated to LLMs for inference requests. Autoscale triggers provide LLMs greater flexibility by:

Increasing resources to LLMs based on scale up and down triggers. This decreases inference latency when more requests come in, then spools idle resources back down to save on costs.
Smooths the allocation of resources by optional autoscaling windows that allows scaling up and down over a longer period of time, preventing sudden resources spikes and drops.

Autoscale triggers work through deployment configurations that have minimum and maximum autoscale replicas set by the parameter replica_autoscale_min_max. The default minimum is 0 replicas. Resources are scaled as follows:

0 Replicas up: If there is 1 or more inference requests in the queue, 1 replica is spun up to process the requests in the queue. Additional resources are spun up based on the autoscale_cpu_utilization setting, where additional replicas are spun up or down when average cpu utilization across all replicas passes the autoscale_cpu_utilization percentage.
If scale_up_queue_depth is set: scale_up_queue_depth is based on the number of requests in the queue plus the requests currently being processed, divided by the number of available replicas. If this threshold is exceeded, then additional replicas are spun up based on the autoscaling_window default of 300 seconds.

This tutorial focuses on demonstrating deploying a Llama V3 8B quantized with Llama.cpp LLM with Wallaroo through the following steps:

Uploading the LLM to Wallaroo.
Defining the autoscale triggers and deploying the LLM with that configuration.
Performing sample inferences on the deployed LLM.

For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Requirements

The following tutorial requires the following:

Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Arbitrary Python aka BYOP Framework. This is available through a Wallaroo representative.
Wallaroo version 2024.3 and above.

Tutorial Steps

Import libraries

The first step is to import the libraries required.

import wallaroo
import pyarrow as pa
import pandas as pd

from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.object import EntityNotFoundError

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Create the Workspace

We will create or retrieve a workspace and call it the llamacpp-testing, then set it as current workspace environment.

# set workspace to `llamacpp-testing`
workspace = wl.get_workspace("llamacpp-testing", create_if_not_exist=True)

wl.set_current_workspace(workspace)

{'name': 'llamacpp-testing', 'id': 37, 'archived': False, 'created_by': 'gabriel.sandu@wallaroo.ai', 'created_at': '2024-10-09T15:47:16.888728+00:00', 'models': [{'name': 'byop-llama3-q2-max-tokens', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 9, 15, 51, 19, 67945, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 9, 15, 51, 19, 67945, tzinfo=tzutc())}, {'name': 'byop-llama3-q2', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 9, 16, 1, 16, 48599, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 9, 16, 1, 16, 48599, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v1', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 20, 25, 52, 719031, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 20, 25, 52, 719031, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v2', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 21, 1, 19, 192231, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 21, 1, 19, 192231, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v3', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 21, 47, 11, 671499, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 21, 47, 11, 671499, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v4', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 22, 9, 25, 155660, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 22, 9, 25, 155660, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v5', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 22, 23, 50, 164895, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 22, 23, 50, 164895, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-instruct-q5', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 22, 39, 0, 340823, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 22, 39, 0, 340823, tzinfo=tzutc())}, {'name': 'byop-llama3-8b-vllm', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 16, 12, 59, 43, 872101, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 16, 12, 59, 43, 872101, tzinfo=tzutc())}], 'pipelines': [{'name': 'scale-test-cp', 'create_time': datetime.datetime(2024, 10, 15, 13, 45, 48, 652485, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp-v1', 'create_time': datetime.datetime(2024, 10, 15, 20, 31, 11, 992396, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'scale-test-fm', 'create_time': datetime.datetime(2024, 10, 11, 18, 37, 35, 718535, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp-v2', 'create_time': datetime.datetime(2024, 10, 15, 20, 42, 26, 447368, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'scale-test-jb', 'create_time': datetime.datetime(2024, 10, 9, 16, 23, 29, 380756, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'scale-test-fm-3', 'create_time': datetime.datetime(2024, 10, 18, 18, 48, 55, 887911, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp-t4', 'create_time': datetime.datetime(2024, 10, 15, 20, 46, 52, 471972, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp', 'create_time': datetime.datetime(2024, 10, 15, 22, 11, 44, 69111, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp2', 'create_time': datetime.datetime(2024, 10, 15, 22, 25, 0, 995668, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp3', 'create_time': datetime.datetime(2024, 10, 15, 22, 40, 20, 354793, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'scale-test-fm-2', 'create_time': datetime.datetime(2024, 10, 14, 15, 48, 31, 681620, tzinfo=tzutc()), 'definition': '[]'}]}

Retrieve Model

In this example, the model is already uploaded to this workspace. We retrieve it with the wallaroo.client.Client.get_model method.

model = wl.get_model("byop-llamacpp-llama3-8b-instruct-q5")
model

Name	byop-llamacpp-llama3-8b-instruct-q5
Version	4511af71-bdcb-4604-85c0-10ef31a2e319
File Name	byop-llamacpp-llama3-8b-q5.zip
SHA	f15edeab3c7fbf08579703cebc415d33085dbfe08eeae2472f8442a2a2124aea
Status	ready
Image Path	proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.3.0-main-5739
Architecture	x86
Acceleration	none
Updated At	2024-15-Oct 22:39:51
Workspace id	37
Workspace name	llamacpp-testing

Deploy the LLM

The LLM is deployed through the following process:

Create a Wallaroo Pipeline and Set the LLM as a Pipeline Step: This sets the process for how inference inputs is passed through deployed LLMs and supporting ML models.
Define the Deployment Configuration: This sets what resources are allocated for the LLM’s use from the clusters.
Deploy the LLM: This deploys the LLM with the defined deployment configuration and pipeline steps.

Build Pipeline and Set Steps

In this process, we create the pipeline, then assign the LLM as a pipeline step to receive inference data and process it.

pipeline_name = "scale-test"

pipeline = wl.build_pipeline(pipeline_name)

pipeline.add_model_step(model)

name	scale-test
created	2024-10-23 13:04:03.411687+00:00
last_updated	2024-10-23 13:32:37.517314+00:00
deployed	False
workspace_id	37
workspace_name	llamacpp-testing
arch	x86
accel	none
tags
versions	60bc4c5d-a4ee-48ae-948e-b3bf3aab1da9, 90bdfec7-7dfe-4cea-b8b1-10b104fb91f8, 725b9957-8e4a-409a-ad46-122b0016a4c9
steps	byop-llama3-8b-vllm
published	False

Define the Deployment Configuration with Autoscaling Triggers

For this step, the following resources are defined for allocation to the LLM when deployed through the class wallaroo.deployment_config.DeploymentConfigBuilder:

Cpus: 4
Memory: 6 Gi
Gpus: 1. When setting gpus for deployment, the deployment_label must be defined to select the appropriate nodepool with the requested gpu resources.

As part of the deployment configuration, we set the autoscale triggers with the following parameters.

Parameter	Type	Description
`scale_up_queue_depth`	`(queue_depth: int)`	The threshold for autoscaling additional deployment resources are scaled up. This requires the deployment configuration parameter `replica_autoscale_min_max` is set. `scale_up_queue_depth` is determined by the formula `(number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window)`. This field overrides the deployment configuration parameter `cpu_utilization`. The `scale_up_queue_depth` applies to all resources in the deployment configuration.
`scale_down_queue_depth`	`(queue_depth: int)`, Default: 1	Only applies with `scale_up_queue_depth` is configured. Scales down resources based on the formula `(number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window)`.
`autoscaling_window`	`(window_seconds: int)` (Default: 300, Minimum allowed: 60)	The period over which to scale up or scale down resources. Only applies when `scale_up_queue_depth` is configured.
`replica_autoscale_min_max`	`(maximum: int, minimum: int = 0)`	Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs.

For our example:

scale_up_queue_depth: 5
scale_down_queue_depth: 1
autoscaling_window: 60 (seconds)
replica_autoscale_min_max: 2 (maximum), 0 (minimum)
Resources per replica:
- Cpus: 4
- Gpu: 1
- Memory: 6Gi

#gpu deployment
deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '6Gi') \
    .sidekick_gpus(model, 1) \
    .deployment_label("wallaroo.ai/accelerator:t4") \
    .replica_autoscale_min_max(2,0) \
    .scale_up_queue_depth(5) \
    .scale_down_queue_depth(1) \
    .autoscaling_window(60) \
    .build()

Deploy the Pipeline

With the parameters set and the deployment configuration with autoscale triggers defined, we deploy the LLM through the pipeline.deploy method and specify the deployment configuration.

pipeline.deploy(deployment_config=deployment_config)

Verify Pipeline Deployment Status

Before submitting inference requests, we verify the pipeline deployment status is Running.

pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.4.7.8',
   'name': 'engine-74c54c9478-7vs6l',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'scale-test',
      'status': 'Running',
      'version': '68ea1561-f5f9-42b6-b0c4-800922dc27af'}]},
   'model_statuses': {'models': [{'model_version_id': 19,
      'name': 'byop-llamacpp-llama3-8b-instruct-q5',
      'sha': 'f15edeab3c7fbf08579703cebc415d33085dbfe08eeae2472f8442a2a2124aea',
      'status': 'Running',
      'version': '4511af71-bdcb-4604-85c0-10ef31a2e319'}]}}],
 'engine_lbs': [{'ip': '10.4.1.26',
   'name': 'engine-lb-6b59985857-97sdr',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.4.7.9',
   'name': 'engine-sidekick-byop-llamacpp-llama3-8b-instruct-q5-19-85ctstwl',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': '\n'}]}

Sample Inference

Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer method, which accepts either a pandas DataFrame or an Apache Arrow table.

For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request.

data = pd.DataFrame({'text': ['Describe what roland garros is']})

result=pipeline.infer(data, timeout=10000)
result["out.generated_text"][0]

' Roland Garros, also known as the French Open, is a prestigious Grand Slam tennis tournament held annually in Paris, France. It\'s one of the four majors in professional tennis and is considered one of the most iconic and challenging tournaments in the sport.\n\nRoland Garros takes place over two weeks in late May and early June on clay courts at the Stade Roland-Garros stadium. The event has a rich history, dating back to 1891, and is often referred to as the "most romantic" Grand Slam due to its unique atmosphere and stunning surroundings.\n\nThe tournament is named after Roland Garros, a French aviator, engineer, and writer who was also an avid tennis player. He was a pioneer in aviation and was credited with being the first pilot to cross the Mediterranean Sea by air.\n\nRoland Garros features five main events: men\'s singles, women\'s singles, men\'s doubles, women\'s doubles, and mixed doubles. The tournament attracts some of the world\'s top tennis players, with many considering it a highlight of their professional careers.\n\nThe French Open is known for its challenging conditions, particularly on the clay courts, which are renowned for their slow pace and high bounce. This requires players to have strong footwork, endurance, and tactical awareness to outmaneuver their opponents.\n\nThroughout the tournament, fans can expect thrilling matches, dramatic upsets, and memorable moments that often define the careers of tennis superstars. Roland Garros is truly a special event in the world of tennis, and its rich history, stunning atmosphere, and iconic status make it an unforgettable experience for players and spectators alike.'

Undeploy LLM

With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.

pipeline.undeploy()

Waiting for undeployment - this will take up to 45s .................................... ok

name	scale-test-jb
created	2024-10-09 16:23:29.380756+00:00
last_updated	2024-10-11 17:14:39.088831+00:00
deployed	False
workspace_id	37
workspace_name	llamacpp-testing
arch	x86
accel	none
tags
versions	a1301693-88cb-4219-a037-ce009e030aa4, 428d18a3-8321-4aef-9d41-b00c97aab6f6, 513ce8ec-0eae-47ee-b1d3-3892d9f0b8f9, 545c1850-3403-41ea-9ed9-5fd820e55f50, 6e612ae7-25ad-4df9-b91c-9e6e11d69506, 2319a92a-14ae-419a-864f-63a2b6911cd4, 8f73fc75-d6f4-432f-a77b-f5eb588f4696, 98336b0a-7503-4ace-ad52-dff236909420
steps	byop-llama3-q2-max-tokens
published	False