Llama 3 8B Instruct with vLLM


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Llama 3 8B Instruct with vLLM

The following tutorial demonstrates deploying a Llama 3 8B Instruct LLM with the vLLM library. This LLM accepts a text prompt from a user and generates a text response.

For access to these sample models and for a demonstration:

This tutorial shows how to upload the vLLM, deploy it, and perform inference requests through Wallaroo.

Tutorial Steps

Import Libraries

We start by importing the libraries used for the rest of the tutorial. This includes the Wallaroo SDK used to upload, deploy, and infer on LLMs in Wallaroo.

import json
import os
import base64

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Architecture

import pyarrow as pa
import numpy as np
import pandas as pd

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

BYOP Overview

This BYOP model takes a text prompt and returns a text output generated by the vLLM.

BYOP Implementation Details

The sample LLM is contained in the Wallaroo Arbitrary Python aka Bring Your Own Predict (BYOP) framework, which allows for LLM deployment with customized user parameters and behaviors.

Llama 3 8B Instruct is used for this example of deploying a vLLM.

  1. To run vLLM on CUDA, vLLM is installed using the subprocess library in python, straight into the Python BYOP code:

    import subprocess
    import sys
    
    pip_command = (
        f'{sys.executable} -m pip install https://github.com/vllm-project/vllm/releases/download/v0.5.2/vllm-0.5.2+cu118-cp38-cp38-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118'
    )
    
    subprocess.check_call(pip_command, shell=True)
    
  2. The model is loaded via the BYOP’s _load_model method and setting model weights that are found here.

    def _load_model(self, model_path):
        llm = LLM(
            model=f"{model_path}/artifacts/Meta-Llama-3-8B-Instruct/"
        )
    
        return llm
    

Upload Model

Before uploading, we define the input and output schemas in Apache PyArrow format. For this example we convert the inputs and output schemas to base64 in preparation of uploading via the Wallaroo MLOps API.

input_schema = pa.schema([
    pa.field("text", pa.string()),
])

base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")
'/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA'
output_schema = pa.schema([
    pa.field("generated_text", pa.string()),
])

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")
'/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA='

Upload via the Wallaroo MLOps API

For this tutorial we upload the LLM using the Wallaroo MLOps API endpoint /v1/api/models/upload_and_convert, providing the following:

curl --progress-bar -X POST \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer {token}" \
  -F 'metadata={"name": "byop-llama-8b-v2", "visibility": "private", "workspace_id": <your-workspace-id>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
  -F "file=@byop-llama3-8b-instruct-vllm.zip;type=application/octet-stream" \
  https://{hostname}/v1/api/models/upload_and_convert | cat

Retrieve the Model

Once uploaded and ready for deployment, the model is retrieved through the list_models method, retrieving the most recent version of the model and saving it to the model variable for later use.

model = wl.list_models()[0].versions()[-1]
model

Deploy LLM

Deploying the Llama.cpp LLM follows these steps:

  • Set the deployment configuration: This sets what resources are allocated from the cluster for the LLMs exclusive use. For this example, the following resources are allocated to the LLM:
    • CPUs: 4
    • RAM: 10 Gi
    • GPUs: 1
  • Deploy the LLM: In this phase, the LLM is added to a Wallaroo Pipeline as a model step, then deployed with the deployment configuration.

Once the model is deployed, it is ready for inference requests.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '10Gi') \
    .sidekick_gpus(model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a1002") \
    .build()
pipeline = wl.build_pipeline("vllm-pipe-v9")
pipeline.add_model_step(model)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()

Inference Requests

Inference requests are submitted to deployed LLMs in Wallaroo either as pandas DataFrames, or Apache Arrow Tables.

For this example, a pandas DataFrame is submitted with two columns:

  • text: The question asked of the LLM.
data = pd.DataFrame({"text": ["Tell me about XAI."]})

The request is submitted to the deployed LLM, and the generated_text field contains the output.

import time
start = time.time()
result = pipeline.infer(data, timeout=10000)
end = time.time()

end - start
result
result["out.generated_text"].values[0]

Undeploy

With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.

pipeline.undeploy()