LLM Inference with vLLM and llama.cpp
- Llama.cpp is known for its portability and efficiency designed to run optimally on CPUs and GPUs without requiring specialized hardware.
- vLLM shines with its emphasis on user-friendliness, rapid inference speeds, and high throughput.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Table of Contents
Key Benefits of vLLM
vLLM delivers the following competitive features:
- Ease of use: One of vLLM’s primary design decision is user-friendliness, making it more accessible to developers with different levels of expertise. vLLM provide a straightforward setup and configuration process for quick development.
- High Performance: vLLM is optimized for high performance, leveraging advanced techniques such as:
- PagedAttention to maximize inference speed.
- Tensor Parallelism enables efficient distribution of computations across multiple GPUs. This results in faster responses and higher throughput, making it the perfect choice for demanding applications.
- Scalability: vLLM is built with scalability in mind by deploying any LLM on a single or multiple GPUs. This scalability makes it suitable for both small-scale and large-scale deployments.
Comparison Between Llama.cpp and vLLM
vLLM and Llama.cpp differ in the following respects.
Performance Metrics
Based on benchmark results listed by vLLM, it outperforms Llama.cpp in several key metrics:
- Iterations: vLLM completes more iterations within the same time frame for higher throughput.
- Requests Per Minute: vLLM handles more requests per minute, demonstrating its efficiency in processing multiple requests.
- Latency: vLLM exhibits lower latency, resulting in faster response times.
- Total Tokens Processed (PP+TG/s): vLLM processes a higher number of tokens per second, showcasing its superior performance in handling large volumes of data.
Hardware Utilization
Llama.cpp is optimized for CPU usage and can run on consumer-grade hardware; vLLM leverages GPU acceleration to achieve higher performance. This makes vLLM more suitable for environments with access to powerful GPUs, whereas Llama.cpp is ideal for scenarios where GPU resources are limited.
Ease of Setup
vLLM offers a more straightforward setup process compared to Llama.cpp. Its user-friendly configuration and comprehensive API support make it easier for developers to get started and integrate LLM capabilities into their applications.
Customization and Flexibility
Llama.cpp provides extensive customization options, allowing developers to fine-tune various parameters to suit specific needs. In contrast, vLLM focuses on ease of use and performance, offering a more streamlined experience with fewer customization requirements.
Deployment in Wallaroo
Both Llama.cpp and vLLM are deployed in Wallaroo using the Arbitrary Python or Bring Your Own Predict (BYOP) framework.
Deploying Llama.cpp in Wallaroo
Deploying LLama.cpp with the Wallaroo BYOP framework requires Llama-cpp-python. This example uses Llama 70B Instruct Q5_K_M for testing and deploying Llama.cpp.
Llama.cpp BYOP Implementation Details
To run Llama-cpp-python on GPU,
llama-cpp-python
is installed using thesubprocess
library inpython
, straight into the Python BYOP code:import subprocess import sys pip_command = ( f'CMAKE_ARGS="-DLLAMA_CUDA=on" {sys.executable} -m pip install llama-cpp-python' ) subprocess.check_call(pip_command, shell=True)
The model is loaded via the BYOP’s
_load_model
method, which supports the biggest context and offloads all the model’s layers to the GPU.def _load_model(self, model_path): llm = Llama( model_path=f"{model_path}/artifacts/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf", n_ctx=4096, n_gpu_layers=-1, logits_all=True, ) return llm
The prompt is constructed based on the chosen model as an instruct-variant.
messages = [ { "role": "system", "content": "You are a generic chatbot, try to answer questions the best you can.", }, {"role": "user", "content": prompt}, ] result = self.model.create_chat_completion( messages=messages, max_tokens=1024, stop=["<|eot_id|>"] )
Llama.cpp Deployment Details
The deployment configuration sets what resources are allocated for the Llama.cpp LLM’s use. For this example, the Llama.cpp LLM is allocated 8 cpus, 10 Gi RAM, and 1 GPU.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(model, 8) \
.sidekick_memory(model, '10Gi') \
.sidekick_gpus(model, 1) \
.deployment_label("wallaroo.ai/accelerator:a100") \
.build()
Deploying vLLM in Wallaroo
vLLM BYOP Implementation Details
Llama 3 8B Instruct is used for this example of deploying a vLLM.
To run vLLM on CUDA,
vLLM
is installed using thesubprocess
library inpython
, straight into the Python BYOP code:import subprocess import sys pip_command = ( f'{sys.executable} -m pip install https://github.com/vllm-project/vllm/releases/download/v0.5.2/vllm-0.5.2+cu118-cp38-cp38-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118' ) subprocess.check_call(pip_command, shell=True)
The model is loaded via the BYOP’s
_load_model
method and setting model weights that are found here.def _load_model(self, model_path): llm = LLM( model=f"{model_path}/artifacts/Meta-Llama-3-8B-Instruct/" ) return llm
Deployment Details
The deployment configuration sets what resources are allocated for the vLLM’s use. For this example, the vLLM is allocated 4 cpus, 10 Gi RAM, and 1 GPU.
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('2Gi') \
.sidekick_cpus(model, 4) \
.sidekick_memory(model, '10Gi') \
.sidekick_gpus(model, 1) \
.deployment_label("wallaroo.ai/accelerator:a100) \
.build()
Conclusions
Both Llama.cpp and vLLM deployed in Wallaroo have potential to be industry standards. Llama.cpp brings portability and efficiency, designed to run optimally on CPUs and GPUs without any specific hardware. vLLM brings user-friendliness, rapid inference speeds, and high throughput, making it an excellent choice for projects that prioritize speed and performance.
Tutorials
The following tutorials demonstrate deploying Llama.cpp and vLLM in Wallaroo.
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today