Llama 3 8B Instruct with vLLM
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Llama 3 8B Instruct with vLLM
The following tutorial demonstrates deploying a Llama 3 8B Instruct LLM with the vLLM
library. This LLM accepts a text prompt from a user and generates a text response.
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
This tutorial shows how to upload the vLLM, deploy it, and perform inference requests through Wallaroo.
Tutorial Steps
Import Libraries
We start by importing the libraries used for the rest of the tutorial. This includes the Wallaroo SDK used to upload, deploy, and infer on LLMs in Wallaroo.
import json
import os
import base64
import wallaroo
from wallaroo.pipeline import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture
import pyarrow as pa
import numpy as np
import pandas as pd
Connect to the Wallaroo Instance
The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
BYOP Overview
This BYOP model takes a text prompt and returns a text output generated by the vLLM.
BYOP Implementation Details
The sample LLM is contained in the Wallaroo Arbitrary Python aka Bring Your Own Predict (BYOP) framework, which allows for LLM deployment with customized user parameters and behaviors.
Llama 3 8B Instruct is used for this example of deploying a vLLM.
To run vLLM on CUDA,
vLLM
is installed using thesubprocess
library inpython
, straight into the Python BYOP code:import subprocess import sys pip_command = ( f'{sys.executable} -m pip install https://github.com/vllm-project/vllm/releases/download/v0.5.2/vllm-0.5.2+cu118-cp38-cp38-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118' ) subprocess.check_call(pip_command, shell=True)
The model is loaded via the BYOP’s
_load_model
method and setting model weights that are found here.def _load_model(self, model_path): llm = LLM( model=f"{model_path}/artifacts/Meta-Llama-3-8B-Instruct/" ) return llm
Upload Model
Before uploading, we define the input and output schemas in Apache PyArrow format. For this example we convert the inputs and output schemas to base64 in preparation of uploading via the Wallaroo MLOps API.
input_schema = pa.schema([
pa.field("text", pa.string()),
])
base64.b64encode(
bytes(input_schema.serialize())
).decode("utf8")
'/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA'
output_schema = pa.schema([
pa.field("generated_text", pa.string()),
])
base64.b64encode(
bytes(output_schema.serialize())
).decode("utf8")
'/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA='
Upload via the Wallaroo MLOps API
For this tutorial we upload the LLM using the Wallaroo MLOps API endpoint /v1/api/models/upload_and_convert
, providing the following:
token
: The authentication bearer tokenhostname
: The hostname of the Wallaroo instance the LLM is uploaded to.
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer {token}" \
-F 'metadata={"name": "byop-llama-8b-v2", "visibility": "private", "workspace_id": <your-workspace-id>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
-F "file=@byop-llama3-8b-instruct-vllm.zip;type=application/octet-stream" \
https://{hostname}/v1/api/models/upload_and_convert | cat
Retrieve the Model
Once uploaded and ready for deployment, the model is retrieved through the list_models
method, retrieving the most recent version of the model and saving it to the model
variable for later use.
model = wl.list_models()[0].versions()[-1]
model
Deploy LLM
Deploying the Llama.cpp LLM follows these steps:
- Set the deployment configuration: This sets what resources are allocated from the cluster for the LLMs exclusive use. For this example, the following resources are allocated to the LLM:
- CPUs: 4
- RAM: 10 Gi
- GPUs: 1
- Deploy the LLM: In this phase, the LLM is added to a Wallaroo Pipeline as a model step, then deployed with the deployment configuration.
Once the model is deployed, it is ready for inference requests.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '10Gi') \
.sidekick_gpus(model, 1) \
.deployment_label("wallaroo.ai/accelerator:a1002") \
.build()
pipeline = wl.build_pipeline("vllm-pipe-v9")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
Inference Requests
Inference requests are submitted to deployed LLMs in Wallaroo either as pandas DataFrames, or Apache Arrow Tables.
For this example, a pandas DataFrame is submitted with two columns:
text
: The question asked of the LLM.
data = pd.DataFrame({"text": ["Tell me about XAI."]})
The request is submitted to the deployed LLM, and the generated_text
field contains the output.
import time
start = time.time()
result = pipeline.infer(data, timeout=10000)
end = time.time()
end - start
result
result["out.generated_text"].values[0]
Undeploy
With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.
pipeline.undeploy()