How to Upload and Deploy vLLM
Virtual Large Language Model (vLLMs) uploads and deployment are supported by Wallaroo. vLLMs deployed in Wallaroo provide hardware acceleration support, OpenAI API compatibility, and resources management provided by Wallaroo’s deployment configurations. The following models examples of models that can take advantage of the Wallaroo vLLM framework. Note that not every scenario has been tested; for specific models and options, consult with your Wallaroo support representative.
- Mistral models
- OpenAI
- Llama
- Llama-3.1-8B-Instruct: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
- Llama-3.3-70B-Instruct: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/tree/main
- Llama-4-Scout-17B-16E-Instruct: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
- Deepseek Qwen
- Qwen-VL: https://huggingface.co/Qwen/Qwen-VL
- Qwen3-Coder-Next: https://huggingface.co/Qwen/Qwen3-Coder-Next
Requirements
| Parameter | Description |
|---|---|
| Web Site | https://vllm.ai/ |
| Supported Libraries | vllm==0.15.1 |
| Framework | Framework.VLLM aka vllm |
vLLM models always run in the Wallaroo Containerized Runtime.
vLLM Configurations and Compatibility
AI Acceleration Compatibility
The following AI Acceleration hardware frameworks are supported by Wallaroo’s vLLM framework.
| Accelerator | ARM Support | X64/X86 Support | Intel GPU | Nvidia GPU | Compatible Version | Description |
|---|---|---|---|---|---|---|
QAIC | X | √ | X | X | vllm==0.8.6 | Qualcomm Cloud AI. AI acceleration compatible with x86/64 architectures. For details on LLM deployment optimizations with QAIC, see LLM Inference with Qualcomm QAIC |
CUDA | √ | √ | X | √ | vllm==0.15.1 | NVIDIA CUDA acceleration supported by both ARM and X64/X86 processors. Intended for deployment with Nvidia GPUs. |
AI acceleration is set at model upload.
OpenAI Compatibility
OpenAI compatibility is enabled by default. Additional configuration options are set post model upload as a model configuration setting.
When OpenAI compatibility is disabled, standard Wallaroo PyArrow schemas are supported.
Continuous Batching
Continuous batching is configured at the model level post model upload.
Upload
Models are uploaded to the Wallaroo Ops instance via either the Wallaroo SDK or the Wallaroo MLOps API. During model upload, the following options are available:
- OpenAI API compatibility is enabled by default.
- AI Hardware Accelerator type is set at model upload; to change the hardware accelerator, the model must be uploaded as a new model version.
Upload vLLM Framework Models via the Wallaroo SDK
Models are uploaded via the wallaroo.client.Client.upload_model method.
SDK Upload Model Parameters
wallaroo.client.Client.upload_model has the following parameters.
| Parameter | Type | Description |
|---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework. For vLLM, this is wallaroo.framework.Framework.VLLM. |
framework_config | wallaroo.framework.VLLMConfig (Optional) | Sets the vLLM framework configuration options. See below for optional parameters. |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. For OpenAI compatibility, the input_schema must be set to pyarrow.schema([]). |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. For OpenAI compatibility, the output_schema must be set to pyarrow.schema([]). |
convert_wait | bool (Optional) |
|
arch | wallaroo.engine_config.Architecture (Optional) | The architecture the model is deployed to. Defaults to X86. |
accel | wallaroo.engine_config.Acceleration (Optional) | The AI hardware accelerator used. If a model is intended for use with a hardware accelerator, it should be assigned at this step.
|
wallaroo.framework.VLLMConfig contains the following parameters.
| Parameter | Type | Description |
|---|---|---|
gpu_memory_utilization | Float (Default: 0.9) | The fraction of GPU memory to use for the model. Use a smaller value if you see out-of-memory errors. |
max_model_len | Integer (Default: None) | The maximum sequence length the model will process. If not set, derived from the model’s config. |
max_num_seqs | Integer (Default: None) | The maximum number of sequences to process in parallel. |
quantization | Quantization (Default: none) | The quantization method to use. Valid values:
|
kv_cache_dtype | KvCacheDtype (Default: auto) | KV cache data type. Valid values:
|
block_size | Integer (Default: None) | The number of tokens per block. Recommended to set to 32 for better performance with Acceleration.QAIC. |
device_group | List[Integer] (Default: None) | List of device IDs to compile the model for. Only supported with Acceleration.QAIC. |
max_seq_len_to_capture | Integer (Default: None) | The maximum sequence length for CUDA graph capture. Only supported with Acceleration.QAIC. |
SDK Upload Model Returns
wallaroo.client.Client.upload_model returns the model version. The model version refers to the version of the model object in Wallaroo. In Wallaroo, a model version update happens when a new model file is uploaded (artifact) against the same model object name.
SDK Upload Model Example
The following example demonstrates uploading a vLLM using the Wallaroo SDK. This assumes OpenAI compatibility is left enabled. The model is uploaded with the following parameters:
- The model name
- The file path to the model
- The framework set to Wallaroo native vLLM runtime:
wallaroo.framework.Framework.VLLM - The input and output schemas are defined in Apache PyArrow format. For OpenAI compatibility, this is left as an empty List.
- The hardware acceleration is set to NVIDIA CUDA.
import wallaroo
import pyarrow as pa
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
# connect to Wallaroo
wl = wallaroo.Client()
# upload the model
model = wl.upload_model(
name = model_name,
path = file_path,
input_schema = pyarrow.schema([]),
output_schema = pyarrow.schema([]),
framework = Framework.VLLM,
accel=Acceleration.CUDA
)
The following example demonstrates uploading the model with customized framework configuration options.
import pyarrow as pa
from wallaroo.framework import Framework, VLLMConfig
from wallaroo.engine_config import Acceleration
model = wl.upload_model(
name=model_name,
path=file_path,
framework=Framework.VLLM,
framework_config=VLLMConfig(
gpu_memory_utilization=0.85,
max_model_len=4096,
quantization="fp8"
),
input_schema=pa.schema([]),
output_schema=pa.schema([]),
accel=Acceleration.CUDA
)
Once uploaded, further model configuration options are available before deployment.
Upload vLLM Framework Models via the Wallaroo MLOps API
The method wallaroo.client.Client.generate_upload_model_api_command generates a curl script for uploading models to Wallaroo via the Wallaroo MLOps API. The generated curl script is based on the Wallaroo SDK user’s current workspace. This is useful for environments that do not have the Wallaroo SDK installed, or uploading very large models (10 gigabytes or more).
The command assumes that other upload parameters are set to default. For details on uploading models via the Wallaroo MLOps API, see Wallaroo MLOps API Essentials Guide: Model Upload and Registrations.
This method takes the following parameters:
| Parameter | Type | Description |
|---|---|---|
| base_url | String (Required) | The Wallaroo domain name. For example: wallaroo.example.com. |
| name | String (Required) | The name to assign the model at upload. This must match DNS naming conventions. |
| path | String (Required) | Path to the ML or LLM model file. |
| framework | String (Required) | The framework from wallaroo.framework.Framework. For a complete list, see Wallaroo Supported Models. |
| input_schema | String (Required) | The model’s input schema in PyArrow.Schema format. |
| output_schema | String (Required) | The model’s output schema in PyArrow.Schema format. |
This outputs a curl command in the following format (indentions added for emphasis). The sections marked with {} represent the variable names that are injected into the script from the above parameter or from the current SDK session:
{Current Workspace['id']}: The value of theidfor the current workspace.{Bearer Token}: The bearer token used to authentication to the Wallaroo MLOps API.
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer {Bearer Token}"
-F "metadata={"name": {name}, "visibility": "private", "workspace_id": {Current Workspace['id']}, "conversion": {"arch": "x86", "accel": "none", "framework": "{framework}", "python_version": "3.8", "requirements": []}, \
"input_schema": "{base64 version of input_schema}", \
"output_schema": "{base64 version of the output_schema}";type=application/json" \
-F "file=@{path};type=application/octet-stream" \
https://{base_url}/v1/api/models/upload_and_convert
Once generated, users can use the script to upload the model via the Wallaroo MLOps API.
The following example shows setting the parameters above and generating the model upload API command.
import wallaroo
import pyarrow as pa
# set the input and output schemas
input_schema = pa.schema([
pa.field("text", pa.string())
])
output_schema = pa.schema([
pa.field("generated_text", pa.string())
])
# use the generate model upload api command
wl.generate_upload_model_api_command(
base_url='https://example.wallaroo.ai/',
name='sample_model_name',
path='llama_byop.zip',
framework={insert framework here},
input_schema=input_schema,
output_schema=output_schema)
The output of this command is:
curl --progress-bar -X POST -H "Content-Type: multipart/form-data" -H "Authorization: Bearer abc123" -F "metadata={"name": "sample_model_name", "visibility": "private", "workspace_id": 20, "conversion": {"arch": "x86", "accel": "none", "framework": "{framework type here}", "python_version": "3.8", "requirements": []}, "input_schema": "/////3AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAABwAAAAEAAAAAAAAAAQAAAB0ZXh0AAAAAAQABAAEAAAA", "output_schema": "/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json" -F "file=@llama_byop.zip;type=application/octet-stream" https://example.wallaroo.ai/v1/api/models/upload_and_convert'
LLM Deploy
LLMs are deployed via the Wallaroo SDK through the following process:
- After the model is uploaded, get the LLM model reference from Wallaroo.
- Create or use an existing Wallaroo pipeline and assign the LLM as a pipeline model step.
- Set the deployment configuration to assign the resources including the number of CPUs, amount of RAM, etc for the LLM deployment.
- Deploy the LLM with the deployment configuration.
Retrieve LLM
LLMs previously uploaded to Wallaroo can be retrieved without re-uploading the LLM via the Wallaroo SDK method wallaroo.client.Client.get_model(name:String, version:String) which takes the following parameters:
name: The name of the model.
The method wallaroo.client.get_model(name) retrieves the most recent model version in the current workspace that matches the provided model name unless a specific version is requested. For more details on managing ML models in Wallaroo, see Manage Models.
The following demonstrates retrieving an uploaded LLM and storing it in the variable model_version.
import wallaroo
# connect with the Wallaroo client
wl = wallaroo.Client()
llm_model = wl.get_model(name=model_name)
OpenAI Compatibility Configuration
OpenAI compatibility is enabled via the model configuration from the class wallaroo.openai_config.OpenaiConfig; by default, OpenAI compatibility is enabled for the vLLM framework with the default parameters. Note that if OpenAI compatibility is disabled, all other parameters are ignored.
| Parameter | Type | Description |
|---|---|---|
enabled | Bool (Default: False)
(Default: True for wallaroo.framework.Framework.VLLM) | If True, OpenAI compatibility is enabled. If False, all other parameters are ignored.
Note: Wallaroo enables this automatically for vLLM models at upload regardless of the OpenaiConfig class default. |
completion_config | Dict | Default parameters for /v1/completions requests. Accepts any OpenAI completion parameter except stream, which is set per inference request. Example: {"temperature": 0.7, "max_tokens": 512}. |
chat_completion_config | Dict | Default parameters for /v1/chat/completions requests. Accepts any OpenAI chat completion parameter except stream, which is set per inference request. Example: {"temperature": 0.3, "max_tokens": 200}. |
With the OpenaiConfig object defined, it is when applied to the LLM configuration through the openai_config parameter.
from wallaroo.openai_config import OpenaiConfig
openai_config = OpenaiConfig(chat_completion_config={"temperature": .3, "max_tokens": 200})
llm_model = llm_model.configure(openai_config=openai_config)
Continuous Batching Configuration
Continuous batching allows vLLM to process inference requests asynchronously, improving throughput under concurrent load. It is configured post-upload via the wallaroo.continuous_batching_config.ContinuousBatchingConfig class and contains the following parameters.
| Parameter | Type | Description |
|---|---|---|
max_concurrent_batch_size | Integer (Default: 1) | The maximum number of requests processed concurrently in a single batch. Must be greater than 0. |
The following is an example of modifying the ContinuousBatchingConfig to set the max_concurrent_batch_size to 5.
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
llm_model = llm_model.configure(
continuous_batching_config=ContinuousBatchingConfig(max_concurrent_batch_size=5)
)
Create the Wallaroo Pipeline and Add Model Step
LLMs are deployed via Wallaroo pipelines. Wallaroo pipelines are created in the current user’s workspace with the Wallaroo SDK method wallaroo.client.Client.build_pipeline(pipeline_name:String) method. This creates a pipeline in the user’s current workspace with with provided pipeline_name, and returns wallaroo.pipeline.Pipeline, which can be saved to a variable for other commands.
Pipeline names are unique within a workspace; using the build_pipeline method within a workspace where another pipeline with the same name exists will connect to the existing pipeline.
Once the pipeline reference is stored to a variable, LLMs are added to the pipeline as a pipeline step with the method wallaroo.pipeline.Pipeline.add_model_step(model_version: wallaroo.model_version.ModelVersion). We demonstrated retrieving the LLM model version in the step Get Model.
This example demonstrates creating a pipeline and adding a model version as a pipeline step. For more details on managing Wallaroo pipelines for model deployment, see the Model Deploy guide.
# create the pipeline
llm_pipeline = wl.build_pipeline('sample-llm-pipeline')
# add the LLM as a pipeline model step
llm_pipeline.add_model_step(llm_model)
Set the Deployment Configuration and Deploy the Model
Before deploying the LLM, a deployment configuration is created. This sets how the cluster’s resources are allocated for the LLM’s exclusive use.
- Pipeline deployment configurations are created through the
wallaroo.deployment_config.DeploymentConfigBuilder()class. - Various options including the number of cpus, RAM, and other resources are set for the Wallaroo Native Runtime, and the Wallaroo Containerized Runtime.
- Typically, LLMs are deployed in the Wallaroo Containerized Runtime, which are referenced in the
DeploymentConfigBuilder’s sidekick options.
- Typically, LLMs are deployed in the Wallaroo Containerized Runtime, which are referenced in the
- Once the configuration options are set the deployment configuration is finalized with the
wallaroo.deployment_config.DeploymentConfigBuilder().build()method.
The following options are available for deployment configurations for LLM deployments. For more details on deployment configurations, see Deployment Configuration guide.
| Method | Parameters | Description |
|---|---|---|
replica_count | (count: int) | The number of replicas to deploy. This allows for multiple deployments of the same models to be deployed to increase inferences through parallelization. |
replica_autoscale_min_max | (maximum: int, minimum: int = 0) | Provides replicas to be scaled from 0 to some maximum number of replicas. This allows deployments to spin up additional replicas as more resources are required, then spin them back down to save on resources and costs. |
autoscale_cpu_utilization | (cpu_utilization_percentage: int) | Sets the average CPU percentage metric for when to load or unload another replica. |
cpus | (core_count: float) | Sets the number or fraction of CPUs to use for the deployment, for example: 0.25, 1, 1.5, etc. The units are similar to the Kubernetes CPU definitions. |
gpus | (core_count: int) | Sets the number of GPUs to allocate for native runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message is returned during deployment. If gpus is called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance. |
memory | (memory_spec: str) | Sets the amount of RAM to allocate the deployment. The memory_spec string is in the format “{size as number}{unit value}”. The accepted unit values are:
|
deployment_label | (label: string) | Label used to match the nodepool label used for the deployment. Required if gpus are set and must match the GPU nodepool label. See Create GPU Nodepools for Kubernetes Clusters for details on setting up GPU nodepools for Wallaroo. |
sidekick_cpus | (model: wallaroo.model.Model, core_count: float) | Sets the number of CPUs to be used for the model’s sidekick container. Only affects image-based models (e.g. MLFlow models) in a deployment. The parameters are as follows:
|
sidekick_gpus | (model: wallaroo.model.Model, core_count: int) | Sets the number of GPUs to allocate for containerized runtimes. GPUs are only allocated in whole units, not as fractions. Organizations should be aware of the total number of GPUs available to the cluster, and monitor which deployment configurations have gpus allocated to ensure they do not run out. If there are not enough gpus to allocate to a deployment configuration, and error message will be returned during deployment. If called, then the deployment_label must be called and match the GPU Nodepool for the Wallaroo Cluster hosting the Wallaroo instance |
sidekick_memory | (model: wallaroo.model.Model, memory_spec: str) | Sets the memory available to for the model’s sidekick container.The parameters are as follows:
|
scale_up_queue_depth | Integer | Sets a queue depth threshold above which replicas scale up. Queue depth is calculated as (requests in queue + requests being processed) / available replicas over the autoscaling window. When set, this overrides the default autoscale_cpu_utilization trigger. Automatically sets scale_down_queue_depth to 1 if not already configured. |
scale_down_queue_depth | Integer (Default: 1) | Sets the queue depth threshold below which replicas scale down. Only applicable when scale_up_queue_depth is configured. |
autoscaling_window | Integer (Default: 300) | Sets the time window in seconds over which autoscaling metrics are evaluated. Only applicable when scale_up_queue_depth is configured. |
Once the deployment configuration is set, the LLM is deployed via the wallaroo.pipeline.Pipeline.deploy(deployment_config: Optional[wallaroo.deployment_config.DeploymentConfig]) method. This allocates resources from the cluster for the LLMs deployment based on the DeploymentConfig settings. If the resources set in the deployment configuration are not available at deployment, an error is returned.
The following example shows setting the deployment configuration for a LLM for deployment on x86 architecture, then deploying a pipeline with this deployment configuration.
from wallaroo.deployment_config import DeploymentConfigBuilder
# set the deployment config with the following:
# Wallaroo Native Runtime: 0.5 cpu, 2 Gi RAM
# Wallaroo Containerized Runtime where the LLM is deployed: 32 CPUs and 40 Gi RAM, 2 GPUS
deployment_config = DeploymentConfigBuilder() \
.cpus(0.5).memory('2Gi') \
.sidekick_cpus(llm_model, 32) \
.sidekick_memory(llm_model, '40Gi') \
.sidekick_gpus(llm_model, 2) \
.deployment_label(deployment_label) \
.build()
llm_pipeline.deploy(deployment_config)
Queue-based Autoscaling Example
Queue-based autoscaling is recommended for LLM deployments since continuous batching makes CPU utilization an unreliable scaling signal. When enabled, queue_depth based settings override CPU based autoscaling deployment configurations.
The following example scales up when queue depth exceeds 4, scales down when it drops below 2, and evaluates metrics over a 60-second window.
from wallaroo.deployment_config import DeploymentConfigBuilder
deployment_config = DeploymentConfigBuilder() \
.cpus(0.5).memory('2Gi') \
.sidekick_cpus(llm_model, 32) \
.sidekick_memory(llm_model, '40Gi') \
.sidekick_gpus(llm_model, 2) \
.deployment_label(deployment_label) \
.replica_autoscale_min_max(maximum=4, minimum=1) \
.scale_up_queue_depth(4) \
.scale_down_queue_depth(2) \
.autoscaling_window(60) \
.build()
llm_pipeline.deploy(deployment_config)
Once deployed, inferences are available. For details on inference requests for OpenAI Compatibility enabled models, see Inference via OpenAI Compatibility Deployments.