Quantized Llava 34B with Llama.cpp


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Quantized Llava 34B with Llama.cpp

The following tutorial demonstrates deploying the LLaVA v1.6 34B GGUF model in Wallaroo, a powerful image-text-to-text model available on Hugging Face.

For access to these sample models and for a demonstration:

Key points to consider:

  • The model is based on the LLaVA (Large Language and Vision Assistant) architecture, which combines language understanding with visual perception capabilities.
  • It is quantized using the GGUF (GPT-Generated Unified Format) for efficient deployment and reduced memory footprint.
  • The repository contains multiple quantization variants, allowing for flexibility in deployment based on hardware constraints and performance requirements. We have chosen Q5_K_M, which is the variant the offers higher precision quantization for improved output quality

Tutorial Steps

Import Libraries

We start by importing the libraries used for the rest of the tutorial. This includes the Wallaroo SDK used to upload, deploy, and infer on LLMs in Wallaroo.

import json
import os
import base64

import wallaroo
from wallaroo.pipeline   import Pipeline
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework

import pyarrow as pa
import numpy as np
import pandas as pd

from PIL import Image

Connect to the Wallaroo Instance

The first step is to connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

BYOP Overview

This BYOP model takes a text prompt and an image in an array format, and it will give an answer about what the question and the image were. It uses the latest version of a quantized model called Llava 1.6 found on HuggingFace, which is loaded via Llama.cpp.

There are two artifacts that are being used by the BYOP model: the actual model and the CLIP model for calculating the image embeddings, which can be found here:

Deploying LLama.cpp with the Wallaroo BYOP framework requires Llama-cpp-python. This example uses Llama 70B Instruct Q5_K_M for testing and deploying Llama.cpp.

BYOP Implementation Details

The sample LLM is contained in the Wallaroo Custom Model aka Bring Your Own Predict (BYOP) framework, which allows for LLM deployment with customized user parameters and behaviors.

  1. To run Llama-cpp-python on GPU, llama-cpp-python is installed using the subprocess library in python, straight into the Python BYOP code:

    import subprocess
    import sys
    
    pip_command = (
        f'CMAKE_ARGS="-DLLAMA_CUDA=on" {sys.executable} -m pip install llama-cpp-python'
    )
    
    subprocess.check_call(pip_command, shell=True)
    
  2. The model is loaded via the BYOP’s _load_model method, which supports the biggest context and offloads all the model’s layers to the GPU.

    def _load_model(self, model_path):
        llm = Llama(
            model_path=f"{model_path}/artifacts/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf",
            n_ctx=4096,
            n_gpu_layers=-1,
            logits_all=True,
        )
    
        return llm
    
  3. The prompt is constructed based on the chosen model as an instruct-variant.

    messages = [
        {
            "role": "system",
            "content": "You are a generic chatbot, try to answer questions the best you can.",
        },
        {"role": "user", "content": prompt},
    ]
    
    result = self.model.create_chat_completion(
        messages=messages, max_tokens=1024, stop=["<|eot_id|>"]
    )
    

Upload Model

Before uploading, we define the input and output schemas in Apache PyArrow format. For this example we convert the inputs and output schemas to base64 in preparation of uploading via the Wallaroo MLOps API.

input_schema = pa.schema([
    pa.field('text', pa.string()),
    pa.field('image', pa.list_(pa.list_(pa.list_(pa.uint8()))))
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string())
])
base64.b64encode(
                bytes(input_schema.serialize())
            ).decode("utf8")
base64.b64encode(
                bytes(output_schema.serialize())
            ).decode("utf8")

Upload via the Wallaroo MLOps API

For this tutorial we upload the LLM using the Wallaroo MLOps API endpoint /v1/api/models/upload_and_convert, providing the following:

curl --progress-bar -X POST \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer {token}" \
  -F 'metadata={"name": "llava-llamacpp-gpu", "visibility": "private", "workspace_id": <your-workspace-id>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": []}, "input_schema": "/////ygBAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAADMAAAABAAAAEz///8AAAEMFAAAABwAAAAEAAAAAQAAABQAAAAFAAAAaW1hZ2UAAABA////eP///wAAAQwUAAAAHAAAAAQAAAABAAAAFAAAAAQAAABpdGVtAAAAAGz///+k////AAABDBQAAAAcAAAABAAAAAEAAAAUAAAABAAAAGl0ZW0AAAAAmP///9D///8AAAECEAAAABwAAAAEAAAAAAAAAAQAAABpdGVtAAAGAAgABAAGAAAACAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAEAAAAdGV4dAAAAAAEAAQABAAAAA==", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
  -F "file=@byop_llava.zip;type=application/octet-stream" \
  https://{hostname}/v1/api/models/upload_and_convert | cat

Retrieve the Model

Once uploaded and ready for deployment, the model is retrieved through the list_models method, retrieving the most recent version of the model and saving it to the model variable for later use.

model = wl.list_models()[0].versions()[-1]
model

Deploy LLM

Deploying the Llama.cpp LLM follows these steps:

  • Set the deployment configuration: This sets what resources are allocated from the cluster for the LLMs exclusive use. For this example, the following resources are allocated to the LLM:
    • CPUs: 8
    • RAM: 30 Gi
    • GPUs: 1
  • Deploy the LLM: In this phase, the LLM is added to a Wallaroo Pipeline as a model step, then deployed with the deployment configuration.

Once the model is deployed, it is ready for inference requests.

# create the deployment configuration

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 8) \
    .sidekick_memory(model, "30Gi") \
    .sidekick_gpus(model, 1) \
    .build()
# add the LLM to a Wallaroo pipeline

pipeline = wl.build_pipeline("llavacpp-pipeline-v3")
pipeline.add_model_step(model)

# deploy the model with the deployment configuration
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()

Inference Requests

Inference requests are submitted to deployed LLMs in Wallaroo either as pandas DataFrames, or Apache Arrow Tables.

For this example, a pandas DataFrame is submitted with two columns:

  • text: The question asked of the LLM.
  • image: An image converted into a numpy array.
im = Image.open('bear.jpeg')
image = np.array(im)
data = pd.DataFrame({'text': ['What is the animal in the image?'], 
                     'image': [image.tolist()]})

The request is submitted to the deployed LLM, and the generated_text field contains the output.

res = pipeline.infer(data, timeout=10000)
res["out.generated_text"].values

Undeploy

With the tutorial complete, we undeploy the LLM and return the resources back to the cluster.

pipeline.undeploy()