Autoscaling with Llama 3 8B and Llama.cpp
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Autoscale Triggers with Llama 3 8B with Llama.cpp Tutorial
Wallaroo deployment configurations set what resources are allocated to LLMs for inference requests. Autoscale triggers provide LLMs greater flexibility by:
- Increasing resources to LLMs based on scale up and down triggers. This decreases inference latency when more requests come in, then spools idle resources back down to save on costs.
- Smooths the allocation of resources by optional autoscaling windows that allows scaling up and down over a longer period of time, preventing sudden resources spikes and drops.
Autoscale triggers work through deployment configurations that have minimum and maximum autoscale replicas set by the parameter replica_autoscale_min_max
. The default minimum is 0 replicas. Resources are scaled as follows:
- 0 Replicas up: If there is 1 or more inference requests in the queue, 1 replica is spun up to process the requests in the queue. Additional resources are spun up based on the
autoscale_cpu_utilization
setting, where additional replicas are spun up or down when average cpu utilization across all replicas passes theautoscale_cpu_utilization
percentage. - If
scale_up_queue_depth
is set:scale_up_queue_depth
is based on the number of requests in the queue plus the requests currently being processed, divided by the number of available replicas. If this threshold is exceeded, then additional replicas are spun up based on theautoscaling_window
default of 300 seconds.
This tutorial focuses on demonstrating deploying a Llama V3 8B quantized with Llama.cpp LLM with Wallaroo through the following steps:
- Uploading the LLM to Wallaroo.
- Defining the autoscale triggers and deploying the LLM with that configuration.
- Performing sample inferences on the deployed LLM.
For access to these sample models and for a demonstration of how to use LLM Listener Monitoring to monitor LLM performance and outputs:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Requirements
The following tutorial requires the following:
- Llama V3 8B quantized with llama-cpp encapsulated in the Wallaroo Custom Model aka BYOP Framework. This is available through a Wallaroo representative.
- Wallaroo version 2024.3 and above.
Tutorial Steps
Import libraries
The first step is to import the libraries required.
import wallaroo
import pyarrow as pa
import pandas as pd
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.object import EntityNotFoundError
Connect to the Wallaroo Instance
A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Create the Workspace
We will create or retrieve a workspace and call it the llamacpp-testing
, then set it as current workspace environment.
# set workspace to `llamacpp-testing`
workspace = wl.get_workspace("llamacpp-testing", create_if_not_exist=True)
wl.set_current_workspace(workspace)
{'name': 'llamacpp-testing', 'id': 37, 'archived': False, 'created_by': 'gabriel.sandu@wallaroo.ai', 'created_at': '2024-10-09T15:47:16.888728+00:00', 'models': [{'name': 'byop-llama3-q2-max-tokens', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 9, 15, 51, 19, 67945, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 9, 15, 51, 19, 67945, tzinfo=tzutc())}, {'name': 'byop-llama3-q2', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 9, 16, 1, 16, 48599, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 9, 16, 1, 16, 48599, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v1', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 20, 25, 52, 719031, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 20, 25, 52, 719031, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v2', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 21, 1, 19, 192231, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 21, 1, 19, 192231, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v3', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 21, 47, 11, 671499, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 21, 47, 11, 671499, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v4', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 22, 9, 25, 155660, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 22, 9, 25, 155660, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-q5-v5', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 22, 23, 50, 164895, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 22, 23, 50, 164895, tzinfo=tzutc())}, {'name': 'byop-llamacpp-llama3-8b-instruct-q5', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 15, 22, 39, 0, 340823, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 15, 22, 39, 0, 340823, tzinfo=tzutc())}, {'name': 'byop-llama3-8b-vllm', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2024, 10, 16, 12, 59, 43, 872101, tzinfo=tzutc()), 'created_at': datetime.datetime(2024, 10, 16, 12, 59, 43, 872101, tzinfo=tzutc())}], 'pipelines': [{'name': 'scale-test-cp', 'create_time': datetime.datetime(2024, 10, 15, 13, 45, 48, 652485, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp-v1', 'create_time': datetime.datetime(2024, 10, 15, 20, 31, 11, 992396, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'scale-test-fm', 'create_time': datetime.datetime(2024, 10, 11, 18, 37, 35, 718535, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp-v2', 'create_time': datetime.datetime(2024, 10, 15, 20, 42, 26, 447368, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'scale-test-jb', 'create_time': datetime.datetime(2024, 10, 9, 16, 23, 29, 380756, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'scale-test-fm-3', 'create_time': datetime.datetime(2024, 10, 18, 18, 48, 55, 887911, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp-t4', 'create_time': datetime.datetime(2024, 10, 15, 20, 46, 52, 471972, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp', 'create_time': datetime.datetime(2024, 10, 15, 22, 11, 44, 69111, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp2', 'create_time': datetime.datetime(2024, 10, 15, 22, 25, 0, 995668, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'llama3-8b-instruct-llamacpp3', 'create_time': datetime.datetime(2024, 10, 15, 22, 40, 20, 354793, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'scale-test-fm-2', 'create_time': datetime.datetime(2024, 10, 14, 15, 48, 31, 681620, tzinfo=tzutc()), 'definition': '[]'}]}
Retrieve Model
In this example, the model is already uploaded to this workspace. We retrieve it with the wallaroo.client.Client.get_model
method.
model = wl.get_model("byop-llamacpp-llama3-8b-instruct-q5")
model
Name | byop-llamacpp-llama3-8b-instruct-q5 |
Version | 4511af71-bdcb-4604-85c0-10ef31a2e319 |
File Name | byop-llamacpp-llama3-8b-q5.zip |
SHA | f15edeab3c7fbf08579703cebc415d33085dbfe08eeae2472f8442a2a2124aea |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2024.3.0-main-5739 |
Architecture | x86 |
Acceleration | none |
Updated At | 2024-15-Oct 22:39:51 |
Workspace id | 37 |
Workspace name | llamacpp-testing |
Deploy the LLM
The LLM is deployed through the following process:
- Create a Wallaroo Pipeline and Set the LLM as a Pipeline Step: This sets the process for how inference inputs is passed through deployed LLMs and supporting ML models.
- Define the Deployment Configuration: This sets what resources are allocated for the LLM’s use from the clusters.
- Deploy the LLM: This deploys the LLM with the defined deployment configuration and pipeline steps.
Build Pipeline and Set Steps
In this process, we create the pipeline, then assign the LLM as a pipeline step to receive inference data and process it.
pipeline_name = "scale-test"
pipeline = wl.build_pipeline(pipeline_name)
pipeline.add_model_step(model)
name | scale-test |
---|---|
created | 2024-10-23 13:04:03.411687+00:00 |
last_updated | 2024-10-23 13:32:37.517314+00:00 |
deployed | False |
workspace_id | 37 |
workspace_name | llamacpp-testing |
arch | x86 |
accel | none |
tags | |
versions | 60bc4c5d-a4ee-48ae-948e-b3bf3aab1da9, 90bdfec7-7dfe-4cea-b8b1-10b104fb91f8, 725b9957-8e4a-409a-ad46-122b0016a4c9 |
steps | byop-llama3-8b-vllm |
published | False |
Define the Deployment Configuration with Autoscaling Triggers
For this step, the following resources are defined for allocation to the LLM when deployed through the class wallaroo.deployment_config.DeploymentConfigBuilder
:
- Cpus: 4
- Memory: 6 Gi
- Gpus: 1. When setting
gpus
for deployment, thedeployment_label
must be defined to select the appropriate nodepool with the requested gpu resources.
As part of the deployment configuration, we set the autoscale triggers with the following parameters.
Parameter | Type | Description |
---|---|---|
scale_up_queue_depth | (queue_depth: int) | The threshold for autoscaling additional deployment resources are scaled up. This requires the deployment configuration parameter replica_autoscale_min_max is set. scale_up_queue_depth is determined by the formula (number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window) . This field overrides the deployment configuration parameter cpu_utilization . The scale_up_queue_depth applies to all resources in the deployment configuration. |
scale_down_queue_depth | (queue_depth: int) , Default: 1 | Only applies with scale_up_queue_depth is configured. Scales down resources based on the formula (number of requests in the queue + requests being processed) / (The number of available replicas set over the autoscaling_window) . |
autoscaling_window | (window_seconds: int) (Default: 300, Minimum allowed: 60) | The period over which to scale up or scale down resources. Only applies when scale_up_queue_depth is configured. |
replica_autoscale_min_max | (maximum: int, minimum: int = 0) | Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs. |
For our example:
scale_up_queue_depth
: 5scale_down_queue_depth
: 1autoscaling_window
: 60 (seconds)replica_autoscale_min_max
: 2 (maximum), 0 (minimum)- Resources per replica:
- Cpus: 4
- Gpu: 1
- Memory: 6Gi
#gpu deployment
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '6Gi') \
.sidekick_gpus(model, 1) \
.deployment_label("wallaroo.ai/accelerator:t4") \
.replica_autoscale_min_max(2,0) \
.scale_up_queue_depth(5) \
.scale_down_queue_depth(1) \
.autoscaling_window(60) \
.build()
Deploy the Pipeline
With the parameters set and the deployment configuration with autoscale triggers defined, we deploy the LLM through the pipeline.deploy
method and specify the deployment configuration.
pipeline.deploy(deployment_config=deployment_config)
Verify Pipeline Deployment Status
Before submitting inference requests, we verify the pipeline deployment status is Running
.
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.4.7.8',
'name': 'engine-74c54c9478-7vs6l',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'scale-test',
'status': 'Running',
'version': '68ea1561-f5f9-42b6-b0c4-800922dc27af'}]},
'model_statuses': {'models': [{'model_version_id': 19,
'name': 'byop-llamacpp-llama3-8b-instruct-q5',
'sha': 'f15edeab3c7fbf08579703cebc415d33085dbfe08eeae2472f8442a2a2124aea',
'status': 'Running',
'version': '4511af71-bdcb-4604-85c0-10ef31a2e319'}]}}],
'engine_lbs': [{'ip': '10.4.1.26',
'name': 'engine-lb-6b59985857-97sdr',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.4.7.9',
'name': 'engine-sidekick-byop-llamacpp-llama3-8b-instruct-q5-19-85ctstwl',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Sample Inference
Once the LLM is deployed, we’ll perform an inference with the wallaroo.pipeline.Pipeline.infer
method, which accepts either a pandas DataFrame or an Apache Arrow table.
For this example, we’ll create a pandas DataFrame with a text query and submit that for our inference request.
data = pd.DataFrame({'text': ['Describe what roland garros is']})
result=pipeline.infer(data, timeout=10000)
result["out.generated_text"][0]
' Roland Garros, also known as the French Open, is a prestigious Grand Slam tennis tournament held annually in Paris, France. It\'s one of the four majors in professional tennis and is considered one of the most iconic and challenging tournaments in the sport.\n\nRoland Garros takes place over two weeks in late May and early June on clay courts at the Stade Roland-Garros stadium. The event has a rich history, dating back to 1891, and is often referred to as the "most romantic" Grand Slam due to its unique atmosphere and stunning surroundings.\n\nThe tournament is named after Roland Garros, a French aviator, engineer, and writer who was also an avid tennis player. He was a pioneer in aviation and was credited with being the first pilot to cross the Mediterranean Sea by air.\n\nRoland Garros features five main events: men\'s singles, women\'s singles, men\'s doubles, women\'s doubles, and mixed doubles. The tournament attracts some of the world\'s top tennis players, with many considering it a highlight of their professional careers.\n\nThe French Open is known for its challenging conditions, particularly on the clay courts, which are renowned for their slow pace and high bounce. This requires players to have strong footwork, endurance, and tactical awareness to outmaneuver their opponents.\n\nThroughout the tournament, fans can expect thrilling matches, dramatic upsets, and memorable moments that often define the careers of tennis superstars. Roland Garros is truly a special event in the world of tennis, and its rich history, stunning atmosphere, and iconic status make it an unforgettable experience for players and spectators alike.'
Undeploy LLM
With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.
pipeline.undeploy()
Waiting for undeployment - this will take up to 45s .................................... ok
name | scale-test-jb |
---|---|
created | 2024-10-09 16:23:29.380756+00:00 |
last_updated | 2024-10-11 17:14:39.088831+00:00 |
deployed | False |
workspace_id | 37 |
workspace_name | llamacpp-testing |
arch | x86 |
accel | none |
tags | |
versions | a1301693-88cb-4219-a037-ce009e030aa4, 428d18a3-8321-4aef-9d41-b00c97aab6f6, 513ce8ec-0eae-47ee-b1d3-3892d9f0b8f9, 545c1850-3403-41ea-9ed9-5fd820e55f50, 6e612ae7-25ad-4df9-b91c-9e6e11d69506, 2319a92a-14ae-419a-864f-63a2b6911cd4, 8f73fc75-d6f4-432f-a77b-f5eb588f4696, 98336b0a-7503-4ace-ad52-dff236909420 |
steps | byop-llama3-q2-max-tokens |
published | False |