Deploy LLMs with OpenAI Compatibility

Wallaroo provides OpenAI compatibility for improved interactive user experiences with LLM-based applications while taking advantage of Wallaroo’s ability to maximize throughput and optimizing latency. AI developers can seamlessly migrate their applications from OpenAI endpoints to Wallaroo endpoints endpoints to Wallaroo on-prem endpoints, in connected and air-gapped environments, without losing any functionality..
For access to sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Wallaroo supports deploying LLMs with OpenAI compatibility. This provides developers and data scientists an easy migration path with their existing OpenAI API deployments while leveraging Wallaroo’s resource optimization to improve user experience while reduce latency and costs.

The following options with Wallaroo OpenAI compatibility are supported:

Token Streaming: Wallaroo supports the OpenAI API token streaming methods. This is supported either through the Wallaroo SDK or through the Wallaroo OpenAI API inference methods for completion and chat/completion.
AI Acceleration: Deploy LLMs with token streaming with NVIDIA CUDA or Qualcomm Cloud AI AI acceleration.
Continuous Batching: Wallaroo Continuous Batching provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:

wallaroo.framework.Framework.VLLM: Native async vLLM implementations.
wallaroo.framework.Framework.CUSTOM: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.

How to Configure LLM Deployment with OpenAI API Compatibility

LLM deployment with OpenAI API compatibility applied through the following process.

Models are uploaded to Wallaroo either in the native vLLM or Custom Model frameworks.
- Any AI acceleration settings are applied at model upload.
OpenAI API compatibility is enabled in the model configuration either during or after model upload.
- Once enabled, all OpenAI API parameters for completion and chat/completion are available except stream; the stream parameter is only available with requested at inference requests.
LLMs are deployed with resource configurations that allocate resources to the LLM’s exclusive use (cpus, memory, gpus, etc).
OpenAPI inference requests with the OpenAI API are made either by:
- OpenAI API clients through the deployed LLM’s inference endpoints.
- The Wallaroo SDK through OpenAI specific methods.
Inference requests override the model’s OpenAI configuration to allow fine tuning of inference request parameters.

Upload LLM to Wallaroo

The following examples demonstrates uploading a LLM to Wallaroo either in the Wallaroo Native vLLM runtime or the Wallaroo Custom vLLM Runtime. Note that at this phase OpenAPI API compatibility is not defined - that is done at the Configure OpenAI API Compatibility step.

Upload LLMs Via the Wallaroo SDK

OpenAI compatibility is configured for LLMs uploaded via the Wallaroo SDK from the following methods:

Define the model upload parameters with wallaroo.client.Client.upload_model method.
- (Optional) Set the upload_model parameter framework_config to specify any vLLM options to increase performance. If no options are specified, the default values are applied.

The method wallaroo.client.Client.upload_model takes the following parameters:

Parameter	Type	Description
`name`	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
`path`	`string` (Required)	The path to the model file being uploaded.
`framework`	`string` (Required)	The framework of the model from `wallaroo.framework.Framework`. For native vLLM, this framework is `wallaroo.framework.Framework.VLLM`. For custom vLLM, this framework is `wallaroo.framework.Framework.VLLM`
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format. For OpenAI compatible LLMs, this field is ignored. Best practice is to provide the empty set `pa.schema([])`.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format. For OpenAI compatible LLMs, this field is ignored. Best practice is to provide the empty set `pa.schema([])`.
`framework_config`	`wallaroo.framework.VLLMConfig` OR `wallaroo.framework.CustomConfig` (Optional)	Sets the vLLM framework configuration options.
`accel`	`wallaroo.engine_config.Acceleration` (Optional)	The optional AI hardware accelerator used. The following options are supported for OpenAI compatibility: `wallaroo.engine_config.Acceleration.QAIC`: Qualcomm Cloud AI. `wallaroo.engine_config.Acceleration.CUDA`: NVIDIA CUDA .
`convert_wait`	`bool` (Optional)	True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

The framework configuration must match the appropriate runtimes:

Runtime	Framework Config
`wallaroo.framework.Framework.VLLM`	`wallaroo.framework.VLLMConfig`
`wallaroo.framework.Framework.CUSTOM`	`wallaroo.framework.CustomConfig`

wallaroo.framework.VLLMConfig and wallaroo.framework.CustomConfig contains the following parameters. If no modifications are made at model upload, the default values are applied.

Parameters	Type
max_num_seqs	Integer (Default: 256)
max_model_len	Integer (Default: None)
max_seq_len_to_capture	Integer (Default: 8192)
quantization	(Default: None)
kv_cache_dtype	(Default: `'auto'`)
gpu_memory_utilization	Float (Default: 0.9)
block_size	(Default: None)
device_group	(Default: None) This setting is ignored for CUDA acceleration.

Upload Example for Native vLLM Frameworks via the Wallaroo SDK

The following demonstrates uploading a Native vLLM Runtime with a framework configuration via the Wallaroo SDK.

# (Optional) set the VLLMConfig values
# If no framework configuration value is set, the default values are applied

standard_framework_config = wallaroo.framework.VLLMConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm model with the framework configuration values
vllm_model = wl.upload_model(model_name, \
                              model_file_name, \
                              framework=wallaroo.framework.Framework.VLLM, \
                              input_schema=input_schema, \
                              output_schema=output_schema \
                              framework_config=standard_framework_config,
                              accel=accel
                            )

Upload Example for Custom Frameworks via the Wallaroo SDK

The following demonstrates uploading a Custom Model Runtime with a framework configuration via the Wallaroo SDK. Typically these models are uploaded to provide additional functionality for the native vLLM runtime model. For example, providing Retrieval-Augmented Generation LLMs (RAG), monitoring listeners, etc. The Custom Model Runtime is then deployed in the same Wallaroo pipeline as the native vLLM runtime.

The following example demonstrates uploading a Custom Framework model to Wallaroo through the Wallaroo SDK. Note that no acceleration value is set as opposed to the LLM - in this example, the LLM uses acceleration while the Custom Model does not require it.

# (Optional) set the VLLMConfig values
# If no framework configuration value is set, the default values are applied

custom_framework_config = wallaroo.framework.CustomConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm model with the framework configuration values
custom_model = wl.upload_model(model_name, \
                              model_file_name, \
                              framework=wallaroo.framework.Framework.CUSTOM, \
                              input_schema=input_schema, \
                              output_schema=output_schema \
                              framework_config=standard_framework_config
                            )

Upload LLMs Via the Wallaroo MLOps API

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

/v1/api/models/upload_and_convert

This endpoint accepts the following parameters.

Field		Type	Description
name		String (Required)	The model name.
visibility		String (Required)	Either `public` or `private`.
workspace_id		String (Required)	The numerical ID of the workspace to upload the model to.
conversion		String (Required)	The conversion parameters that include the following:
	framework	String (Required)	The framework of the model being uploaded. For Native vLLM runtimes, this value is `vllm`. For Custom vLLM runtimes, this value is `custom`
	accel	String (Optional) OR Dict (Optional)	The AI accelerator used. For continuous batching, supported types are `cuda` and `qaic`. If using `qaic`, this parameter is either a string to use the default parameters, or as a `Dict` for hardware acceleration parameters. For more details, see LLM Inference with Qualcomm QAIC
	python_version	String (Required)	The version of Python required for the model. For Native and Custom vLLM frameworks, this value is `3.8`.
	requirements	String (Required)	Required libraries. For Native and Custom vLLM frameworks, this value is `[]`.
	framework_config	Dict (Optional)	The framework configuration. See the `framework_config` parameters below for further details.
	input_schema	String (Optional)	The input schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`.
	output_schema	String (Optional)	The output schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`.

The framework_config parameter accepts the following parameters.

Field		Type
config		Dict The framework configuration values. The following subset are parameters of the `config` field.
	max_num_seqs	Integer (Default: 256)
	max_model_len	Integer (Default: None)
	max_seq_len_to_capture	Integer (Default: 8192)
	quantization	(Default: None)
	kv_cache_dtype	(Default: `'auto'`)
	gpu_memory_utilization	Float (Default: 0.9)
	block_size	(Default: None)
	device_group	(Default: None) This setting is ignored for CUDA acceleration.
framework		String The framework of the `framework_config` type. For Native vLLM frameworks, this value is `"vllm"`. For Custom vLLM frameworks, this value is `"custom"`.

Upload Example for Native vLLM Runtime via the MLOps API

The following example demonstrates uploading a Native vLLM Framework model with the framework configuration via the Wallaroo MLOps API.

# define the input and output parameters in Apache pyarrow format
# the input and output schemas are ignored for OpenAI compatible LLMs, so only an empty set is needed

input_schema = pa.schema([])
output_schema = pa.schema([])

# convert the input and output values to base64
base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")

# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128
# acceleration = CUDA

curl --progress-bar \
    -X POST \
    -H "Content-Type: multipart/form-data" \
    -H "Authorization: Bearer abc123" \
    -F \'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": 6, "conversion": {"arch": "x86", "accel": "cuda", "framework": "vllm", "framework_config": {"config": {"gpu_memory_utilization": 0.9, "kv_cache_dtype": "auto", "max_model_len": 128, "max_num_seqs": 256, "max_seq_len_to_capture": 8192, "quantization": "none"}, "framework": "vllm"}, "python_version": "3.8", "requirements": []}, "input_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA=", "output_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA="};type=application/json\' -F "file=@<file path to vllm>;type=application/octet-stream" \
    https://example.wallaroo.ai/v1/api/models/upload_and_convert

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This is used to apply the OpenAI API compatibility configuration.

# Retrieve the model
vllm_model = wl.get_model(your-model-name)

Upload Example for Custom Model Runtime via the MLOps API

The following example demonstrates uploading a Custom vLLM Framework model with the framework configuration via the Wallaroo MLOps API, then retrieving the model version from the Wallaroo SDK.

# define the input and output parameters in Apache pyarrow format

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

# convert the input and output values to base64
base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")

# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128

curl --progress-bar -X POST \
   -H "Content-Type: multipart/form-data" \
   -H "Authorization: Bearer <your-auth-token-here>" \
   -F 'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "custom"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
   -F "file=@<file path to custom vllm runtime>" \
   https://<Wallaroo Hostname>/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This is used to apply the OpenAI API compatibility configuration.

# Retrieve the model
custom_model = wl.get_model(your-model-name)

Configure OpenAI API Compatibility

OpenAI API compatibility is applied to either native or custom vLLM runtimes either before or after the LLM is uploaded to Wallaroo via the Wallaroo SDK.

This includes the following parameters.

Parameter	Type	Description
`openai_config`	from wallaroo.openai_config import OpenaiConfig (Default: None)	Sets the OpenAI API configuration options.

The class wallaroo.openai_config.OpenaiConfig includes the following main parameters. The essential one is enabled - if OpenAI compatibility is not enabled, all other parameters are ignored.

Parameter	Type	Description
`enabled`	Boolean (Default: False)	If `True`, OpenAI compatibility is enabled. If `False`, OpenAI compatibility is not enabled. All other parameters are ignored if `enabled=False`.
`completion_config`	Dict	The OpenAI API `completion` parameters. All `completion` parameters are available except `stream`; the `stream` parameter is only set at inference requests.
`chat_completion_config`	Dict	The OpenAI API `chat/completion` parameters. All `completion` parameters are available except `stream`; the `stream` parameter is only set at inference requests.

Configure OpenAI API Compatibility Example

The following example demonstrates enabling and applying an OpenAI configuration to an uploaded native vLLM Runtime and Custom Model runtime. Note that the OpenAI configuration is the same for either runtime, and the inputs

# enables the OpenAI compatibility and sets `completion_config` and `chat_completion_config` parameters.
openai_config = OpenaiConfig(
    enabled=True,
    completion_config={
        "temperature": .3,
        "max_tokens": 200
    },
    chat_completion_config={
        "temperature": .3,
        "max_tokens": 200,
        "chat_template": """
        {% for message in messages %}
            {% if message['role'] == 'user' %}
                {{ '<|user|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'system' %}
                {{ '<|system|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'assistant' %}
                {{ '<|assistant|>\n'  + message['content'] + eos_token }}
            {% endif %}
            
            {% if loop.last and add_generation_prompt %}
                {{ '<|assistant|>' }}
            {% endif %}
        {% endfor %}"""
    })

# enable openai config Native vLLM runtime
vllm_model_openai_configured = vllm_model.configure(openai_config=openai_config)

# enable openai config Custom Model runtime
custom_model_openai_configured = custom_model.configure(openai_config=openai_config)

Deploy LLM with OpenAI Compatibility

Once the OpenAI compatibility for either native vLLM runtimes or Wallaroo Custom Model runtimes is enabled, they are deployed through the following process.

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
Create a Wallaroo pipeline and add the model(s) as pipeline steps.
- Inference inputs are submitted to the first model step, with their output submitted to the next model step, until the final model step output it returned.
Deploy the Wallaroo pipeline with the deployment configuration.

Deployment Configuration for LLMs

The deployment configuration sets what resources are allocated for model use. For this example, the native vLLM runtime with OpenAI compatibility enabled is allocated:

1 cpus
8 Gi RAM
1 gpu

The specific gpu type is inherited from the upload_model accel parameter; the deployment_label sets which node to use that has the gpu hardware inherited from the model so it is available to the model on deployment.

native_deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(vllm_model_openai_configured, 1) \
    .sidekick_memory(vllm_model_openai_configured, '8Gi') \
    .sidekick_gpus(vllm_model_openai_configured, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

The following example shows the deployment configuration for deploying both a Custom Model runtime and native vLLM runtime with OpenAI compatibility enabled. In this example, the deployment configuration allocates the following:

Custom Model runtime with OpenAI compatibility enabled.
- 1 cpus
- 2 Gi RAM
Native vLLM runtime with OpenAI compatibility enabled.
- 1 cpus
- 8 Gi
- 1 gpu

custom_deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(custom_model_openai_configured, 1) \
    .sidekick_memory(custom_model_openai_configured, '2Gi') \
    .sidekick_cpus(vllm_model_openai_configured, 1) \
    .sidekick_memory(vllm_model_openai_configured, '8Gi') \
    .sidekick_gpus(vllm_model_openai_configured, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

Create Wallaroo Pipeline and Deploy

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM.

The following demonstrates creating a Wallaroo pipeline, and adding the native vLLM with OpenAI compatibility enabled as a pipeline step. Once set, the pipeline is deployed with the defined deployment configuration.

# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-openai-enabled-pipeline')

# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(vllm_model_openai_configured)

# deploy with the deployment configuration

vllm_pipeline.deploy(deployment_config=native_deployment_config)

The following demonstrates creating a Wallaroo pipeline, and adding first the Custom Model with OpenAI compatibility, then the native vLLM with OpenAI compatibility enabled. In this scenario, the outputs from the Custom Model are the inputs for the LLM, with the final model step outputs returned as the inference output.

# create the pipeline
custom_with_vllm_pipeline = wl.build_pipeline('sample-custom-openai-enabled-pipeline')

# add the custom model and LLM as a pipeline model steps
custom_with_vllm_pipeline.add_model_step(custom_model_openai_configured)
custom_with_vllm_pipeline.add_model_step(vllm_model_openai_configured)

# deploy with the deployment configuration
custom_with_vllm_pipeline.add_model_step(deployment_config=custom_deployment_config)

Once deployment is complete, inference requests are accepted via either the Wallaroo SDK or the pipeline’s OpenAI API client endpoint.

How to Publish for Edge Deployment

Wallaroo pipelines are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config) command. This uploads the following artifacts to the OCI registry:

The native vLLM runtimes or custom models with OpenAI compatibility enabled.
If specified, the deployment configuration.
The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.

Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.

For more details, see Edge and Multi-cloud Pipeline Publish.

The following example demonstrates publishing a Wallaroo pipeline with a native vLLM runtime.

pipeline.publish(deployment_config=native_deployment_config)

Inference OpenAI Compatibility Enabled Deployments

For details on how to perform inference requests on OpenAI Compatibility Enabled deployments, see Inference via OpenAI Compatibility Deployments.