Llama 3 8B Instruct with vLLM
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Llama 3 8B Instruct Inference with vLLM
The following tutorial demonstrates deploying the Llama 3 8B Instruct Inference with vLLM LLM with Wallaroo. This tutorial focuses on:
- Uploading the model
- Preparing the model for deployment.
- Deploying the model and performing inferences.
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Context
This BYOP model uses the vLLM
library and the Llama 3 8B Instruct LLM. Once deployed, it accepts a a text prompt from the user and generates a text response appropriate to the prompt.
What is vLLM?
vLLM, or Very Large Language Model serving engine, is designed to enhance the efficiency and performance of deploying large language models (LLMs). It stands out for its innovative approach utilizing a novel attention algorithm known as PagedAttention. This technology effectively organizes attention keys and values into smaller, manageable segments, significantly reducing memory usage and boosting throughput compared to traditional methods.
One of the key advantages of vLLM is its ability to achieve much higher throughput: up to 24 times greater than HuggingFace Transformers, a widely-used LLM library. This capability allows for serving a larger number of users with fewer computational resources, making vLLM an attractive option for organizations looking to optimize their LLM deployments.
Model Overview
The LLM used in this demonstrates has the following attributes.
- Framework:
vllm
for more optimized model deployment, uploaded to Wallaroo in the Wallaroo Custom Model aka Bring Your Own Predict (BYOP) Framework. - Artifacts: The original model is here the Llama 3 8B Instruct Hugging Face model:Llama 3 8B Instruct
- Input/Output Types: Both the input and outputs are text.
Implementation Details
For our sample vLLM, the original model is encapsulated in the Wallaroo BYOP framework with the following adjustments.
vLLM Library Installation
To run vLLM on CUDA, a specific vLLM
Python wheel is used with an extra index to install the proper library. To accommodate this, the following pip install
code is executed directly in the BYOP Python script to install the vLLM
via the subprocess
library:
import subprocess
import sys
pip_command = (
f'{sys.executable} -m pip install https://github.com/vllm-project/vllm/releases/download/v0.5.2/vllm-0.5.2+cu118-cp310-cp310-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118'
)
subprocess.check_call(pip_command, shell=True)
Model Loading
Loading the model uses vLLM
with the original model weights. The model weights that are found on the Llama 3 8B Instruct model page.
def _load_model(self, model_path):
llm = LLM(
model=f"{model_path}/artifacts/Meta-Llama-3-8B-Instruct/"
)
return llm
Tutorial Steps
Import Libraries
We start by importing the required libraries. This includes the following:
- Wallaroo SDK: Used to upload and deploy the model in Wallaroo.
- pyarrow: Models uploaded to Wallaroo are defined in the input/output format.
- pandas: Data is submitted to models deployed in Wallaroo as either Apache Arrow Table format or pandas Record Format as a DataFrame.
import json
import os
import base64
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
A connection to Wallaroo is set through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Upload the Model
For this example, the model is uploaded via the Wallaroo MLOps API. To save time, we use the Wallaroo MLOps Upload Generate Command from the Wallaroo SDK method wallaroo.client.Client.generate_upload_model_api_command
. This generates a curl
script for uploading models to Wallaroo via the Wallaroo MLOps API, and takes the following parameters:
Parameter | Type | Description |
---|---|---|
base_url | String (Required) | The Wallaroo domain name. For example: wallaroo.example.com . |
name | String (Required) | The name to assign the model at upload. This must match DNS naming conventions. |
path | String (Required) | Path to the ML or LLM model file. |
framework | String (Required) | The framework from wallaroo.framework.Framework For a complete list, see Wallaroo Supported Models. |
input_schema | String (Required) | The model’s input schema in PyArrow.Schema format. |
output_schema | String (Required) | The model’s output schema in PyArrow.Schema format. |
This generates an output similar to the following, used to upload the model via the Wallaroo MLops API.
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer abcdefg" \
-F "metadata={"name": "byop-llama-8b-v2", "visibility": "private", "workspace_id": 8, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json" \
-F "file=@byop-llama3-8b-instruct-vllm.zip;type=application/octet-stream"
https://doc-test.wallarooexample.ai/v1/api/models/upload_and_convert
# define the input and output schemas
import wallaroo.framework
input_schema = pa.schema([
pa.field("text", pa.string()),
])
output_schema = pa.schema([
pa.field("generated_text", pa.string()),
])
# generate the curl command and execute it
wl.generate_upload_model_api_command(
base_url=wl.api_endpoint,
name="byop-llama-8b-v2",
path="byop-llama3-8b-instruct-vllm.zip",
framework=wallaroo.framework.Framework.CUSTOM,
input_schema=input_schema,
output_schema=output_schema
)
'curl --progress-bar -X POST -H "Content-Type: multipart/form-data" -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICI1aUdMclZ1NVluOE1nOU5xSDQtZGdJRXBQQTJqbVRYMHFaWlJQYXZpS2tJIn0.eyJleHAiOjE3Mjk2MTUyOTksImlhdCI6MTcyOTYxNTIzOSwianRpIjoiYTk1MjYzMTAtYzY0Mi00ZTA2LWJkZGMtNDgyM2YwYWI1YWNhIiwiaXNzIjoiaHR0cHM6Ly9kb2MtdGVzdC53YWxsYXJvb2NvbW11bml0eS5uaW5qYS9hdXRoL3JlYWxtcy9tYXN0ZXIiLCJhdWQiOlsibWFzdGVyLXJlYWxtIiwiYWNjb3VudCJdLCJzdWIiOiIzZWVhYjU1NC1mYzJlLTQxYWMtOGI0ZS0wZDc3OGU4YTQ3MWIiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJzZGstY2xpZW50Iiwic2Vzc2lvbl9zdGF0ZSI6ImI0NWI4ZmRjLWNmY2YtNGQ3ZC04NmVhLTU2MTJjNjY2NDBmNSIsImFjciI6IjEiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiY3JlYXRlLXJlYWxtIiwiZGVmYXVsdC1yb2xlcy1tYXN0ZXIiLCJvZmZsaW5lX2FjY2VzcyIsImFkbWluIiwidW1hX2F1dGhvcml6YXRpb24iXX0sInJlc291cmNlX2FjY2VzcyI6eyJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsidmlldy1yZWFsbSIsInZpZXctaWRlbnRpdHktcHJvdmlkZXJzIiwibWFuYWdlLWlkZW50aXR5LXByb3ZpZGVycyIsImltcGVyc29uYXRpb24iLCJjcmVhdGUtY2xpZW50IiwibWFuYWdlLXVzZXJzIiwicXVlcnktcmVhbG1zIiwidmlldy1hdXRob3JpemF0aW9uIiwicXVlcnktY2xpZW50cyIsInF1ZXJ5LXVzZXJzIiwibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hdXRob3JpemF0aW9uIiwibWFuYWdlLWNsaWVudHMiLCJxdWVyeS1ncm91cHMiXX0sImFjY291bnQiOnsicm9sZXMiOlsibWFuYWdlLWFjY291bnQiLCJtYW5hZ2UtYWNjb3VudC1saW5rcyIsInZpZXctcHJvZmlsZSJdfX0sInNjb3BlIjoicHJvZmlsZSBlbWFpbCBvcGVuaWQiLCJzaWQiOiJiNDViOGZkYy1jZmNmLTRkN2QtODZlYS01NjEyYzY2NjQwZjUiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiaHR0cHM6Ly9oYXN1cmEuaW8vand0L2NsYWltcyI6eyJ4LWhhc3VyYS11c2VyLWlkIjoiM2VlYWI1NTQtZmMyZS00MWFjLThiNGUtMGQ3NzhlOGE0NzFiIiwieC1oYXN1cmEtdXNlci1lbWFpbCI6ImpvaG4uaGFuc2FyaWNrQHdhbGxhcm9vLmFpIiwieC1oYXN1cmEtZGVmYXVsdC1yb2xlIjoiYWRtaW5fdXNlciIsIngtaGFzdXJhLWFsbG93ZWQtcm9sZXMiOlsidXNlciIsImFkbWluX3VzZXIiXSwieC1oYXN1cmEtdXNlci1ncm91cHMiOiJ7fSJ9LCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJqb2huLmhhbnNhcmlja0B3YWxsYXJvby5haSIsImVtYWlsIjoiam9obi5oYW5zYXJpY2tAd2FsbGFyb28uYWkifQ.L4i8bVauByo8eb0j7-KDUyPrvZSUKUm4_smh1SIb3WtyERVwHY0-qcKPewJtFK16KnRheUDfhZ60Z-mPQUasTVQkZMajElEq0cEOAbTsXHAvi9kgQbUYkpoKOaDcBuylrqMwsYe4aACZPav1nGTsq-vWn4mR2YLofm7fd81emCDbm6ufIjjZV38pfFNIVQIH1O_ownjATRoYx2Lt7j1kGpOz3AF4EZsD6gPNBqlnxVubTpq144ymX9J5Etq5zdiIfhaOsYMre_FzhZYllIYItDc9hJ0B6ROpd9vawHmqCXxBj7Mn5O62Q9Qesh4C9t8KN4egIhTTkWeSuJHgi7yLoQ" -F "metadata={"name": "byop-llama-8b-v2", "visibility": "private", "workspace_id": 8, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json" -F "file=@byop-llama3-8b-instruct-vllm.zip;type=application/octet-stream" https://doc-test.wallarooexample.ai/v1/api/models/upload_and_convert'
Retrieve the Model
Once the model is uploaded, we retrieve it through the wallaroo.client.Client.get_model
method.
llm = wl.get_model("byop-llama-8b-v2")
Deploy the LLM
The LLM is deployed through the following process:
- Define the Deployment Configuration: This sets what resources are allocated for the LLM’s use from the clusters.
- Create a Wallaroo Pipeline and Set the LLM as a Pipeline Step: This sets the process for how inference inputs is passed through deployed LLMs and supporting ML models.
- Deploy the LLM: This deploys the LLM with the defined deployment configuration and pipeline steps.
Define the Deployment Configuration
For this step, the following resources are defined for allocation to the LLM when deployed through the class wallaroo.deployment_config.DeploymentConfigBuilder
:
- Cpus: 4
- Memory: 10 Gi
- Gpus: 1. When setting
gpus
for deployment, thedeployment_label
must be defined to select the appropriate nodepool with the requested gpu resources.
deployment_config = wallaroo.deployment_config.DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(llm, 4) \
.sidekick_memory(llm, '10Gi') \
.sidekick_gpus(llm, 1) \
.deployment_label("wallaroo.ai/accelerator:a1002") \
.build()
Create Pipeline and Steps
In this step, the Wallaroo pipeline is established with the LLM set as the pipeline step.
pipeline = wl.build_pipeline("vllm-pipe-v9")
pipeline.add_model_step(llm)
Deploy the LLM
With the Wallaroo pipeline created and the deployment configuration set, we deploy the LLM and set the deployment configuration to allocate the appropriate resources for the LLM’s exclusive use.
pipeline.deploy(deployment_config=deployment_config)
Inference
Inference requests are submitted to deployed LLM’s either as Apache Arrow Tables or pandas DataFrames. For this example, a pandas DataFrame is submitted through the wallaroo.pipeline.Pipeline.infer
method.
For this example, the start
and end
time is collected to determine how long the inference request took.
data = pd.DataFrame({"text": ["Tell me about XAI."]})
import time
start = time.time()
result = pipeline.infer(data, timeout=10000)
end = time.time()
end - start
result["out.generated_text"].values[0]
Undeploy the LLM
With the example completed, the LLM is undeployed and the resources returned to the cluster.
pipeline.undeploy()
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today