Deploy LLMs with OpenAI Compatibility

Wallaroo provides OpenAI compatibility for improved interactive user experiences with LLM-based applications while taking advantage of Wallaroo’s ability to maximize throughput and optimizing latency. AI developers can seamlessly migrate their applications from OpenAI endpoints to Wallaroo endpoints endpoints to Wallaroo on-prem endpoints, in connected and air-gapped environments, without losing any functionality..
For access to sample models and a demonstration on using LLMs with Wallaroo:

Table of Contents

Wallaroo supports deploying LLMs with OpenAI compatibility. This provides developers and data scientists an easy migration path with their existing OpenAI API deployments while leveraging Wallaroo’s resource optimization to improve user experience while reduce latency and costs.

The following options with Wallaroo OpenAI compatibility are supported:

  • Token Streaming: Wallaroo supports the OpenAI API token streaming methods. This is supported either through the Wallaroo SDK or through the Wallaroo OpenAI API inference methods for completion and chat/completion.
  • AI Acceleration: Deploy LLMs with token streaming with NVIDIA CUDA or Qualcomm Cloud AI AI acceleration.
  • Continuous Batching: Wallaroo Continuous Batching provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:

  • wallaroo.framework.Framework.VLLM: Native async vLLM implementations.
  • wallaroo.framework.Framework.CUSTOM: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.

How to Configure LLM Deployment with OpenAI API Compatibility

LLM deployment with OpenAI API compatibility applied through the following process.

  • Models are uploaded to Wallaroo either in the native vLLM or Custom Model frameworks.
    • Any AI acceleration settings are applied at model upload.
  • OpenAI API compatibility is enabled in the model configuration either during or after model upload.
    • Once enabled, all OpenAI API parameters for completion and chat/completion are available except stream; the stream parameter is only available with requested at inference requests.
  • LLMs are deployed with resource configurations that allocate resources to the LLM’s exclusive use (cpus, memory, gpus, etc).
  • OpenAPI inference requests with the OpenAI API are made either by:
    • OpenAI API clients through the deployed LLM’s inference endpoints.
    • The Wallaroo SDK through OpenAI specific methods.
  • Inference requests override the model’s OpenAI configuration to allow fine tuning of inference request parameters.

Upload LLM to Wallaroo

The following examples demonstrates uploading a LLM to Wallaroo either in the Wallaroo Native vLLM runtime or the Wallaroo Custom vLLM Runtime. Note that at this phase OpenAPI API compatibility is not defined - that is done at the Configure OpenAI API Compatibility step.

Upload LLMs Via the Wallaroo SDK

OpenAI compatibility is configured for LLMs uploaded via the Wallaroo SDK from the following methods:

  • Define the model upload parameters with wallaroo.client.Client.upload_model method.
    • (Optional) Set the upload_model parameter framework_config to specify any vLLM options to increase performance. If no options are specified, the default values are applied.

The method wallaroo.client.Client.upload_model takes the following parameters:

ParameterTypeDescription
namestring (Required)The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
pathstring (Required)The path to the model file being uploaded.
frameworkstring (Required)The framework of the model from wallaroo.framework.Framework. For native vLLM, this framework is wallaroo.framework.Framework.VLLM. For custom vLLM, this framework is wallaroo.framework.Framework.VLLM
input_schemapyarrow.lib.Schema (Required)The input schema in Apache Arrow schema format. For OpenAI compatible LLMs, this field is ignored. Best practice is to provide the empty set pa.schema([]).
output_schemapyarrow.lib.Schema (Required)The output schema in Apache Arrow schema format. For OpenAI compatible LLMs, this field is ignored. Best practice is to provide the empty set pa.schema([]).
framework_configwallaroo.framework.VLLMConfig OR wallaroo.framework.CustomConfig (Optional)Sets the vLLM framework configuration options.
accelwallaroo.engine_config.Acceleration (Optional)The optional AI hardware accelerator used. The following options are supported for OpenAI compatibility:
  • wallaroo.engine_config.Acceleration.QAIC: Qualcomm Cloud AI.
  • wallaroo.engine_config.Acceleration.CUDA: NVIDIA CUDA
.
convert_waitbool (Optional)
  • True: Waits in the script for the model conversion completion.
  • False: Proceeds with the script without waiting for the model conversion process to display complete.

The framework configuration must match the appropriate runtimes:

RuntimeFramework Config
wallaroo.framework.Framework.VLLMwallaroo.framework.VLLMConfig
wallaroo.framework.Framework.CUSTOMwallaroo.framework.CustomConfig

wallaroo.framework.VLLMConfig and wallaroo.framework.CustomConfig contains the following parameters. If no modifications are made at model upload, the default values are applied.

ParametersType
max_num_seqsInteger (Default: 256)
max_model_lenInteger (Default: None)
max_seq_len_to_captureInteger (Default: 8192)
quantization(Default: None)
kv_cache_dtype(Default: 'auto')
gpu_memory_utilizationFloat (Default: 0.9)
block_size(Default: None)
device_group(Default: None) This setting is ignored for CUDA acceleration.
Upload Example for Native vLLM Frameworks via the Wallaroo SDK

The following demonstrates uploading a Native vLLM Runtime with a framework configuration via the Wallaroo SDK.

# (Optional) set the VLLMConfig values
# If no framework configuration value is set, the default values are applied

standard_framework_config = wallaroo.framework.VLLMConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm model with the framework configuration values
vllm_model = wl.upload_model(model_name, \
                              model_file_name, \
                              framework=wallaroo.framework.Framework.VLLM, \
                              input_schema=input_schema, \
                              output_schema=output_schema \
                              framework_config=standard_framework_config,
                              accel=accel
                            )
Upload Example for Custom Frameworks via the Wallaroo SDK

The following demonstrates uploading a Custom Model Runtime with a framework configuration via the Wallaroo SDK. Typically these models are uploaded to provide additional functionality for the native vLLM runtime model. For example, providing Retrieval-Augmented Generation LLMs (RAG), monitoring listeners, etc. The Custom Model Runtime is then deployed in the same Wallaroo pipeline as the native vLLM runtime.

The following example demonstrates uploading a Custom Framework model to Wallaroo through the Wallaroo SDK. Note that no acceleration value is set as opposed to the LLM - in this example, the LLM uses acceleration while the Custom Model does not require it.

# (Optional) set the VLLMConfig values
# If no framework configuration value is set, the default values are applied

custom_framework_config = wallaroo.framework.CustomConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm model with the framework configuration values
custom_model = wl.upload_model(model_name, \
                              model_file_name, \
                              framework=wallaroo.framework.Framework.CUSTOM, \
                              input_schema=input_schema, \
                              output_schema=output_schema \
                              framework_config=standard_framework_config
                            )

Upload LLMs Via the Wallaroo MLOps API

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

  • /v1/api/models/upload_and_convert

This endpoint accepts the following parameters.

Field TypeDescription
name String (Required)The model name.
visibility String (Required)Either public or private.
workspace_id String (Required)The numerical ID of the workspace to upload the model to.
conversion String (Required)The conversion parameters that include the following:
 frameworkString (Required)The framework of the model being uploaded. For Native vLLM runtimes, this value is vllm. For Custom vLLM runtimes, this value is custom
 accelString (Optional) OR Dict (Optional)The AI accelerator used. For continuous batching, supported types are cuda and qaic. If using qaic, this parameter is either a string to use the default parameters, or as a Dict for hardware acceleration parameters. For more details, see LLM Inference with Qualcomm QAIC
 python_versionString (Required)The version of Python required for the model. For Native and Custom vLLM frameworks, this value is 3.8.
 requirementsString (Required)Required libraries. For Native and Custom vLLM frameworks, this value is [].
 framework_configDict (Optional)The framework configuration. See the framework_config parameters below for further details.
 input_schemaString (Optional)The input schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode.
 output_schemaString (Optional)The output schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode.

The framework_config parameter accepts the following parameters.

Field TypeDescription
config Dict The framework configuration values. The following subset are parameters of the config field.
 max_num_seqsInteger (Default: 256)
 max_model_lenInteger (Default: None)
 max_seq_len_to_captureInteger (Default: 8192)
 quantization(Default: None)
 kv_cache_dtype(Default: 'auto')
 gpu_memory_utilizationFloat (Default: 0.9)
 block_size(Default: None)
 device_group(Default: None) This setting is ignored for CUDA acceleration.
framework String The framework of the framework_config type. For Native vLLM frameworks, this value is "vllm". For Custom vLLM frameworks, this value is "custom".
Upload Example for Native vLLM Runtime via the MLOps API

The following example demonstrates uploading a Native vLLM Framework model with the framework configuration via the Wallaroo MLOps API.

# define the input and output parameters in Apache pyarrow format
# the input and output schemas are ignored for OpenAI compatible LLMs, so only an empty set is needed

input_schema = pa.schema([])
output_schema = pa.schema([])

# convert the input and output values to base64
base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")
# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128
# acceleration = CUDA

curl --progress-bar \
    -X POST \
    -H "Content-Type: multipart/form-data" \
    -H "Authorization: Bearer abc123" \
    -F \'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": 6, "conversion": {"arch": "x86", "accel": "cuda", "framework": "vllm", "framework_config": {"config": {"gpu_memory_utilization": 0.9, "kv_cache_dtype": "auto", "max_model_len": 128, "max_num_seqs": 256, "max_seq_len_to_capture": 8192, "quantization": "none"}, "framework": "vllm"}, "python_version": "3.8", "requirements": []}, "input_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA=", "output_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA="};type=application/json\' -F "file=@<file path to vllm>;type=application/octet-stream" \
    https://example.wallaroo.ai/v1/api/models/upload_and_convert

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This is used to apply the OpenAI API compatibility configuration.

# Retrieve the model
vllm_model = wl.get_model(your-model-name)
Upload Example for Custom Model Runtime via the MLOps API

The following example demonstrates uploading a Custom vLLM Framework model with the framework configuration via the Wallaroo MLOps API, then retrieving the model version from the Wallaroo SDK.

# define the input and output parameters in Apache pyarrow format

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

# convert the input and output values to base64
base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")
# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128

curl --progress-bar -X POST \
   -H "Content-Type: multipart/form-data" \
   -H "Authorization: Bearer <your-auth-token-here>" \
   -F 'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "custom"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
   -F "file=@<file path to custom vllm runtime>" \
   https://<Wallaroo Hostname>/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This is used to apply the OpenAI API compatibility configuration.

# Retrieve the model
custom_model = wl.get_model(your-model-name)

Configure OpenAI API Compatibility

OpenAI API compatibility is applied to either native or custom vLLM runtimes either before or after the LLM is uploaded to Wallaroo via the Wallaroo SDK.

This includes the following parameters.

ParameterTypeDescription
openai_configfrom wallaroo.openai_config import OpenaiConfig (Default: None)Sets the OpenAI API configuration options.

The class wallaroo.openai_config.OpenaiConfig includes the following main parameters. The essential one is enabled - if OpenAI compatibility is not enabled, all other parameters are ignored.

ParameterTypeDescription
enabledBoolean (Default: False)If True, OpenAI compatibility is enabled. If False, OpenAI compatibility is not enabled. All other parameters are ignored if enabled=False.
completion_configDictThe OpenAI API completion parameters. All completion parameters are available except stream; the stream parameter is only set at inference requests.
chat_completion_configDictThe OpenAI API chat/completion parameters. All completion parameters are available except stream; the stream parameter is only set at inference requests.

Configure OpenAI API Compatibility Example

The following example demonstrates enabling and applying an OpenAI configuration to an uploaded native vLLM Runtime and Custom Model runtime. Note that the OpenAI configuration is the same for either runtime, and the inputs

# enables the OpenAI compatibility and sets `completion_config` and `chat_completion_config` parameters.
openai_config = OpenaiConfig(
    enabled=True,
    completion_config={
        "temperature": .3,
        "max_tokens": 200
    },
    chat_completion_config={
        "temperature": .3,
        "max_tokens": 200,
        "chat_template": """
        {% for message in messages %}
            {% if message['role'] == 'user' %}
                {{ '<|user|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'system' %}
                {{ '<|system|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'assistant' %}
                {{ '<|assistant|>\n'  + message['content'] + eos_token }}
            {% endif %}
            
            {% if loop.last and add_generation_prompt %}
                {{ '<|assistant|>' }}
            {% endif %}
        {% endfor %}"""
    })

# enable openai config Native vLLM runtime
vllm_model_openai_configured = vllm_model.configure(openai_config=openai_config)

# enable openai config Custom Model runtime
custom_model_openai_configured = custom_model.configure(openai_config=openai_config)

Deploy LLM with OpenAI Compatibility

Once the OpenAI compatibility for either native vLLM runtimes or Wallaroo Custom Model runtimes is enabled, they are deployed through the following process.

  • Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
  • Create a Wallaroo pipeline and add the model(s) as pipeline steps.
    • Inference inputs are submitted to the first model step, with their output submitted to the next model step, until the final model step output it returned.
  • Deploy the Wallaroo pipeline with the deployment configuration.

Deployment Configuration for LLMs

The deployment configuration sets what resources are allocated for model use. For this example, the native vLLM runtime with OpenAI compatibility enabled is allocated:

  • 1 cpus
  • 8 Gi RAM
  • 1 gpu

The specific gpu type is inherited from the upload_model accel parameter; the deployment_label sets which node to use that has the gpu hardware inherited from the model so it is available to the model on deployment.

native_deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(vllm_model_openai_configured, 1) \
    .sidekick_memory(vllm_model_openai_configured, '8Gi') \
    .sidekick_gpus(vllm_model_openai_configured, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

The following example shows the deployment configuration for deploying both a Custom Model runtime and native vLLM runtime with OpenAI compatibility enabled. In this example, the deployment configuration allocates the following:

  • Custom Model runtime with OpenAI compatibility enabled.
    • 1 cpus
    • 2 Gi RAM
  • Native vLLM runtime with OpenAI compatibility enabled.
    • 1 cpus
    • 8 Gi
    • 1 gpu
custom_deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(custom_model_openai_configured, 1) \
    .sidekick_memory(custom_model_openai_configured, '2Gi') \
    .sidekick_cpus(vllm_model_openai_configured, 1) \
    .sidekick_memory(vllm_model_openai_configured, '8Gi') \
    .sidekick_gpus(vllm_model_openai_configured, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

Create Wallaroo Pipeline and Deploy

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM.

The following demonstrates creating a Wallaroo pipeline, and adding the native vLLM with OpenAI compatibility enabled as a pipeline step. Once set, the pipeline is deployed with the defined deployment configuration.

# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-openai-enabled-pipeline')

# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(vllm_model_openai_configured)

# deploy with the deployment configuration

vllm_pipeline.deploy(deployment_config=native_deployment_config)

The following demonstrates creating a Wallaroo pipeline, and adding first the Custom Model with OpenAI compatibility, then the native vLLM with OpenAI compatibility enabled. In this scenario, the outputs from the Custom Model are the inputs for the LLM, with the final model step outputs returned as the inference output.

# create the pipeline
custom_with_vllm_pipeline = wl.build_pipeline('sample-custom-openai-enabled-pipeline')

# add the custom model and LLM as a pipeline model steps
custom_with_vllm_pipeline.add_model_step(custom_model_openai_configured)
custom_with_vllm_pipeline.add_model_step(vllm_model_openai_configured)

# deploy with the deployment configuration
custom_with_vllm_pipeline.add_model_step(deployment_config=custom_deployment_config)

Once deployment is complete, inference requests are accepted via either the Wallaroo SDK or the pipeline’s OpenAI API client endpoint.

How to Publish for Edge Deployment

Wallaroo pipelines are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config) command. This uploads the following artifacts to the OCI registry:

  • The native vLLM runtimes or custom models with OpenAI compatibility enabled.
  • If specified, the deployment configuration.
  • The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.

Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.

For more details, see Edge and Multi-cloud Pipeline Publish.

The following example demonstrates publishing a Wallaroo pipeline with a native vLLM runtime.

pipeline.publish(deployment_config=native_deployment_config)

Inference OpenAI Compatibility Enabled Deployments

For details on how to perform inference requests on OpenAI Compatibility Enabled deployments, see Inference via OpenAI Compatibility Deployments.