Llama 3 8B Instruct with vLLM


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Llama 3 8B Instruct Inference with vLLM

The following tutorial demonstrates deploying the Llama 3 8B Instruct Inference with vLLM LLM with Wallaroo. This tutorial focuses on:

  • Uploading the model
  • Preparing the model for deployment.
  • Deploying the model and performing inferences.

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.

Context

This BYOP model uses the vLLM library and the Llama 3 8B Instruct LLM. Once deployed, it accepts a a text prompt from the user and generates a text response appropriate to the prompt.

What is vLLM?

vLLM, or Very Large Language Model serving engine, is designed to enhance the efficiency and performance of deploying large language models (LLMs). It stands out for its innovative approach utilizing a novel attention algorithm known as PagedAttention. This technology effectively organizes attention keys and values into smaller, manageable segments, significantly reducing memory usage and boosting throughput compared to traditional methods.

One of the key advantages of vLLM is its ability to achieve much higher throughput: up to 24 times greater than HuggingFace Transformers, a widely-used LLM library. This capability allows for serving a larger number of users with fewer computational resources, making vLLM an attractive option for organizations looking to optimize their LLM deployments.

Model Overview

The LLM used in this demonstrates has the following attributes.

Implementation Details

For our sample vLLM, the original model is encapsulated in the Wallaroo BYOP framework with the following adjustments.

vLLM Library Installation

To run vLLM on CUDA, a specific vLLM Python wheel is used with an extra index to install the proper library. To accommodate this, the following pip install code is executed directly in the BYOP Python script to install the vLLM via the subprocess library:

import subprocess
import sys

pip_command = (
    f'{sys.executable} -m pip install https://github.com/vllm-project/vllm/releases/download/v0.5.2/vllm-0.5.2+cu118-cp38-cp38-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118'
)

subprocess.check_call(pip_command, shell=True)

Model Loading

Loading the model uses vLLM with the original model weights. The model weights that are found on the Llama 3 8B Instruct model page.

def _load_model(self, model_path):
    llm = LLM(
        model=f"{model_path}/artifacts/Meta-Llama-3-8B-Instruct/"
    )

    return llm

Tutorial Steps

Import Libraries

We start by importing the required libraries. This includes the following:

  • Wallaroo SDK: Used to upload and deploy the model in Wallaroo.
  • pyarrow: Models uploaded to Wallaroo are defined in the input/output format.
  • pandas: Data is submitted to models deployed in Wallaroo as either Apache Arrow Table format or pandas Record Format as a DataFrame.
import json
import os
import base64

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

A connection to Wallaroo is set through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Upload the Model

For this example, the model is uploaded via the Wallaroo MLOps API. To save time, we use the Wallaroo MLOps Upload Generate Command from the Wallaroo SDK method wallaroo.client.Client.generate_upload_model_api_command. This generates a curl script for uploading models to Wallaroo via the Wallaroo MLOps API, and takes the following parameters:

ParameterTypeDescription
base_urlString (Required)The Wallaroo domain name. For example: wallaroo.example.com.
nameString (Required)The name to assign the model at upload. This must match DNS naming conventions.
pathString (Required)Path to the ML or LLM model file.
frameworkString (Required)The framework from wallaroo.framework.Framework For a complete list, see Wallaroo Supported Models.
input_schemaString (Required)The model’s input schema in PyArrow.Schema format.
output_schemaString (Required)The model’s output schema in PyArrow.Schema format.

This generates an output similar to the following, used to upload the model via the Wallaroo MLops API.

curl --progress-bar -X POST \
    -H "Content-Type: multipart/form-data" \
    -H "Authorization: Bearer abcdefg" \
    -F "metadata={"name": "byop-llama-8b-v2", "visibility": "private", "workspace_id": 8, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json" \
    -F "file=@byop-llama3-8b-instruct-vllm.zip;type=application/octet-stream"
    https://doc-test.wallarooexample.ai/v1/api/models/upload_and_convert
# define the input and output schemas

import wallaroo.framework

input_schema = pa.schema([
    pa.field("text", pa.string()),
])

output_schema = pa.schema([
    pa.field("generated_text", pa.string()),
])

# generate the curl command and execute it

wl.generate_upload_model_api_command(
    base_url=wl.api_endpoint,
    name="byop-llama-8b-v2",
    path="byop-llama3-8b-instruct-vllm.zip",
    framework=wallaroo.framework.Framework.CUSTOM,
    input_schema=input_schema,
    output_schema=output_schema
)
'curl --progress-bar -X POST            -H "Content-Type: multipart/form-data"            -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICI1aUdMclZ1NVluOE1nOU5xSDQtZGdJRXBQQTJqbVRYMHFaWlJQYXZpS2tJIn0.eyJleHAiOjE3Mjk2MTUyOTksImlhdCI6MTcyOTYxNTIzOSwianRpIjoiYTk1MjYzMTAtYzY0Mi00ZTA2LWJkZGMtNDgyM2YwYWI1YWNhIiwiaXNzIjoiaHR0cHM6Ly9kb2MtdGVzdC53YWxsYXJvb2NvbW11bml0eS5uaW5qYS9hdXRoL3JlYWxtcy9tYXN0ZXIiLCJhdWQiOlsibWFzdGVyLXJlYWxtIiwiYWNjb3VudCJdLCJzdWIiOiIzZWVhYjU1NC1mYzJlLTQxYWMtOGI0ZS0wZDc3OGU4YTQ3MWIiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJzZGstY2xpZW50Iiwic2Vzc2lvbl9zdGF0ZSI6ImI0NWI4ZmRjLWNmY2YtNGQ3ZC04NmVhLTU2MTJjNjY2NDBmNSIsImFjciI6IjEiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiY3JlYXRlLXJlYWxtIiwiZGVmYXVsdC1yb2xlcy1tYXN0ZXIiLCJvZmZsaW5lX2FjY2VzcyIsImFkbWluIiwidW1hX2F1dGhvcml6YXRpb24iXX0sInJlc291cmNlX2FjY2VzcyI6eyJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsidmlldy1yZWFsbSIsInZpZXctaWRlbnRpdHktcHJvdmlkZXJzIiwibWFuYWdlLWlkZW50aXR5LXByb3ZpZGVycyIsImltcGVyc29uYXRpb24iLCJjcmVhdGUtY2xpZW50IiwibWFuYWdlLXVzZXJzIiwicXVlcnktcmVhbG1zIiwidmlldy1hdXRob3JpemF0aW9uIiwicXVlcnktY2xpZW50cyIsInF1ZXJ5LXVzZXJzIiwibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hdXRob3JpemF0aW9uIiwibWFuYWdlLWNsaWVudHMiLCJxdWVyeS1ncm91cHMiXX0sImFjY291bnQiOnsicm9sZXMiOlsibWFuYWdlLWFjY291bnQiLCJtYW5hZ2UtYWNjb3VudC1saW5rcyIsInZpZXctcHJvZmlsZSJdfX0sInNjb3BlIjoicHJvZmlsZSBlbWFpbCBvcGVuaWQiLCJzaWQiOiJiNDViOGZkYy1jZmNmLTRkN2QtODZlYS01NjEyYzY2NjQwZjUiLCJlbWFpbF92ZXJpZmllZCI6dHJ1ZSwiaHR0cHM6Ly9oYXN1cmEuaW8vand0L2NsYWltcyI6eyJ4LWhhc3VyYS11c2VyLWlkIjoiM2VlYWI1NTQtZmMyZS00MWFjLThiNGUtMGQ3NzhlOGE0NzFiIiwieC1oYXN1cmEtdXNlci1lbWFpbCI6ImpvaG4uaGFuc2FyaWNrQHdhbGxhcm9vLmFpIiwieC1oYXN1cmEtZGVmYXVsdC1yb2xlIjoiYWRtaW5fdXNlciIsIngtaGFzdXJhLWFsbG93ZWQtcm9sZXMiOlsidXNlciIsImFkbWluX3VzZXIiXSwieC1oYXN1cmEtdXNlci1ncm91cHMiOiJ7fSJ9LCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJqb2huLmhhbnNhcmlja0B3YWxsYXJvby5haSIsImVtYWlsIjoiam9obi5oYW5zYXJpY2tAd2FsbGFyb28uYWkifQ.L4i8bVauByo8eb0j7-KDUyPrvZSUKUm4_smh1SIb3WtyERVwHY0-qcKPewJtFK16KnRheUDfhZ60Z-mPQUasTVQkZMajElEq0cEOAbTsXHAvi9kgQbUYkpoKOaDcBuylrqMwsYe4aACZPav1nGTsq-vWn4mR2YLofm7fd81emCDbm6ufIjjZV38pfFNIVQIH1O_ownjATRoYx2Lt7j1kGpOz3AF4EZsD6gPNBqlnxVubTpq144ymX9J5Etq5zdiIfhaOsYMre_FzhZYllIYItDc9hJ0B6ROpd9vawHmqCXxBj7Mn5O62Q9Qesh4C9t8KN4egIhTTkWeSuJHgi7yLoQ"            -F "metadata={"name": "byop-llama-8b-v2", "visibility": "private", "workspace_id": 8, "conversion": {"arch": "x86", "accel": "none", "framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json"            -F "file=@byop-llama3-8b-instruct-vllm.zip;type=application/octet-stream"        https://doc-test.wallarooexample.ai/v1/api/models/upload_and_convert'

Retrieve the Model

Once the model is uploaded, we retrieve it through the wallaroo.client.Client.get_model method.

llm = wl.get_model("byop-llama-8b-v2")

Deploy the LLM

The LLM is deployed through the following process:

  • Define the Deployment Configuration: This sets what resources are allocated for the LLM’s use from the clusters.
  • Create a Wallaroo Pipeline and Set the LLM as a Pipeline Step: This sets the process for how inference inputs is passed through deployed LLMs and supporting ML models.
  • Deploy the LLM: This deploys the LLM with the defined deployment configuration and pipeline steps.

Define the Deployment Configuration

For this step, the following resources are defined for allocation to the LLM when deployed through the class wallaroo.deployment_config.DeploymentConfigBuilder:

  • Cpus: 4
  • Memory: 10 Gi
  • Gpus: 1. When setting gpus for deployment, the deployment_label must be defined to select the appropriate nodepool with the requested gpu resources.
deployment_config = wallaroo.deployment_config.DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(llm, 4) \
    .sidekick_memory(llm, '10Gi') \
    .sidekick_gpus(llm, 1) \
    .deployment_label("wallaroo.ai/accelerator:a1002") \
    .build()

Create Pipeline and Steps

In this step, the Wallaroo pipeline is established with the LLM set as the pipeline step.

pipeline = wl.build_pipeline("vllm-pipe-v9")
pipeline.add_model_step(llm)

Deploy the LLM

With the Wallaroo pipeline created and the deployment configuration set, we deploy the LLM and set the deployment configuration to allocate the appropriate resources for the LLM’s exclusive use.

pipeline.deploy(deployment_config=deployment_config)

Inference

Inference requests are submitted to deployed LLM’s either as Apache Arrow Tables or pandas DataFrames. For this example, a pandas DataFrame is submitted through the wallaroo.pipeline.Pipeline.infer method.

For this example, the start and end time is collected to determine how long the inference request took.

data = pd.DataFrame({"text": ["Tell me about XAI."]})
import time
start = time.time()
result = pipeline.infer(data, timeout=10000)
end = time.time()

end - start
result["out.generated_text"].values[0]

Undeploy the LLM

With the example completed, the LLM is undeployed and the resources returned to the cluster.

pipeline.undeploy()

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.