LLM Inference with vLLM and llama.cpp

Large Language Models (LLMs) are the go-to solution in terms of Neuro-linguistic programming (NLP), promoting the the need for efficient and scalable deployment solutions. Llama.cpp and Virtual LLM (vLLM) are two versatile tools for optimizing LLM deployments with innovative solutions to different pitfalls of LLMs.
  • Llama.cpp is known for its portability and efficiency designed to run optimally on CPUs and GPUs without requiring specialized hardware.
  • vLLM shines with its emphasis on user-friendliness, rapid inference speeds, and high throughput.
For access to these sample models and a demonstration on using LLMs with Wallaroo:

Table of Contents

Key Benefits of vLLM

vLLM delivers the following competitive features:

  • Ease of use: One of vLLM’s primary design decision is user-friendliness, making it more accessible to developers with different levels of expertise. vLLM provide a straightforward setup and configuration process for quick development.
  • High Performance: vLLM is optimized for high performance, leveraging advanced techniques such as:
    • PagedAttention to maximize inference speed.
    • Tensor Parallelism enables efficient distribution of computations across multiple GPUs. This results in faster responses and higher throughput, making it the perfect choice for demanding applications.
  • Scalability: vLLM is built with scalability in mind by deploying any LLM on a single or multiple GPUs. This scalability makes it suitable for both small-scale and large-scale deployments.

Comparison Between Llama.cpp and vLLM

vLLM and Llama.cpp differ in the following respects.

Performance Metrics

Based on benchmark results listed by vLLM, it outperforms Llama.cpp in several key metrics:

  • Iterations: vLLM completes more iterations within the same time frame for higher throughput.
  • Requests Per Minute: vLLM handles more requests per minute, demonstrating its efficiency in processing multiple requests.
  • Latency: vLLM exhibits lower latency, resulting in faster response times.
  • Total Tokens Processed (PP+TG/s): vLLM processes a higher number of tokens per second, showcasing its superior performance in handling large volumes of data.

Hardware Utilization

Llama.cpp is optimized for CPU usage and can run on consumer-grade hardware; vLLM leverages GPU acceleration to achieve higher performance. This makes vLLM more suitable for environments with access to powerful GPUs, whereas Llama.cpp is ideal for scenarios where GPU resources are limited.

Ease of Setup

vLLM offers a more straightforward setup process compared to Llama.cpp. Its user-friendly configuration and comprehensive API support make it easier for developers to get started and integrate LLM capabilities into their applications.

Customization and Flexibility

Llama.cpp provides extensive customization options, allowing developers to fine-tune various parameters to suit specific needs. In contrast, vLLM focuses on ease of use and performance, offering a more streamlined experience with fewer customization requirements.

Deployment in Wallaroo

Both Llama.cpp and vLLM are deployed in Wallaroo using the Custom Model or Bring Your Own Predict (BYOP) framework.

Deploying Llama.cpp in Wallaroo

Deploying LLama.cpp with the Wallaroo BYOP framework requires Llama-cpp-python. This example uses Llama 70B Instruct Q5_K_M for testing and deploying Llama.cpp.

Llama.cpp BYOP Implementation Details

  1. To run Llama-cpp-python on GPU, llama-cpp-python is installed using the subprocess library in python, straight into the Python BYOP code:

    import subprocess
    import sys
    
    pip_command = (
        f'CMAKE_ARGS="-DLLAMA_CUDA=on" {sys.executable} -m pip install llama-cpp-python'
    )
    
    subprocess.check_call(pip_command, shell=True)
    
  2. The model is loaded via the BYOP’s _load_model method, which supports the biggest context and offloads all the model’s layers to the GPU.

    def _load_model(self, model_path):
        llm = Llama(
            model_path=f"{model_path}/artifacts/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf",
            n_ctx=4096,
            n_gpu_layers=-1,
            logits_all=True,
        )
    
        return llm
    
  3. The prompt is constructed based on the chosen model as an instruct-variant.

    messages = [
        {
            "role": "system",
            "content": "You are a generic chatbot, try to answer questions the best you can.",
        },
        {"role": "user", "content": prompt},
    ]
    
    result = self.model.create_chat_completion(
        messages=messages, max_tokens=1024, stop=["<|eot_id|>"]
    )
    

Llama.cpp Deployment Details

The deployment configuration sets what resources are allocated for the Llama.cpp LLM’s use. For this example, the Llama.cpp LLM is allocated 8 cpus, 10 Gi RAM, and 1 GPU.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 8) \
    .sidekick_memory(model, '10Gi') \
    .sidekick_gpus(model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100") \
    .build()

Deploying vLLM in Wallaroo

vLLM BYOP Implementation Details

Llama 3 8B Instruct is used for this example of deploying a vLLM.

  1. To run vLLM on CUDA, vLLM is installed using the subprocess library in python, straight into the Python BYOP code:

    import subprocess
    import sys
    
    pip_command = (
        f'{sys.executable} -m pip install https://github.com/vllm-project/vllm/releases/download/v0.5.2/vllm-0.5.2+cu118-cp38-cp38-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118'
    )
    
    subprocess.check_call(pip_command, shell=True)
    
  2. The model is loaded via the BYOP’s _load_model method and setting model weights that are found here.

    def _load_model(self, model_path):
        llm = LLM(
            model=f"{model_path}/artifacts/Meta-Llama-3-8B-Instruct/"
        )
    
        return llm
    

Deployment Details

The deployment configuration sets what resources are allocated for the vLLM’s use. For this example, the vLLM is allocated 4 cpus, 10 Gi RAM, and 1 GPU.

deployment_config = DeploymentConfigBuilder() \
    .cpus(1).memory('2Gi') \
    .sidekick_cpus(model, 4) \
    .sidekick_memory(model, '10Gi') \
    .sidekick_gpus(model, 1) \
    .deployment_label("wallaroo.ai/accelerator:a100) \
    .build()

Conclusions

Both Llama.cpp and vLLM deployed in Wallaroo have potential to be industry standards. Llama.cpp brings portability and efficiency, designed to run optimally on CPUs and GPUs without any specific hardware. vLLM brings user-friendliness, rapid inference speeds, and high throughput, making it an excellent choice for projects that prioritize speed and performance.

Tutorials

The following tutorials demonstrate deploying Llama.cpp and vLLM in Wallaroo.

For access to these sample models and a demonstration on using LLMs with Wallaroo: