Deploy LLMs with OpenAI Compatibility

Wallaroo provides OpenAI compatibility for improved interactive user experiences with LLM-based applications while taking advantage of Wallaroo’s ability to maximize throughput and optimizing latency. AI developers can seamlessly migrate their applications from OpenAI endpoints to Wallaroo endpoints endpoints to Wallaroo on-prem endpoints, in connected and air-gapped environments, without losing any functionality..
For access to sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Wallaroo supports deploying LLMs with OpenAI compatibility. This provides developers and data scientists an easy migration path with their existing OpenAI API deployments while leveraging Wallaroo’s resource optimization to improve user experience while reduce latency and costs.

The following options with Wallaroo OpenAI compatibility are supported:

Token Streaming: Wallaroo supports the OpenAI API token streaming methods. This is supported either through the Wallaroo SDK or through the Wallaroo OpenAI API inference methods for completion and chat/completion.
AI Acceleration: Deploy LLMs with token streaming with NVIDIA CUDA or Qualcomm Cloud AI AI acceleration.
Continuous Batching: Wallaroo Continuous Batching provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.

For access to these sample models and a demonstration on using LLMs with Wallaroo:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:

wallaroo.framework.Framework.VLLM: Native async vLLM implementations.
wallaroo.framework.Framework.CUSTOM: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.

How to Configure LLM Deployment with OpenAI API Compatibility

LLM deployment with OpenAI API compatibility applied through the following process.

Models are uploaded to Wallaroo either in the native vLLM or Custom Model frameworks.
- Any AI acceleration settings are applied at model upload.
OpenAI API compatibility is enabled in the model configuration either during or after model upload.
- Once enabled, all OpenAI API parameters for completion and chat/completion are available except stream; the stream parameter is only available with requested at inference requests.
LLMs are deployed with resource configurations that allocate resources to the LLM’s exclusive use (cpus, memory, gpus, etc).
OpenAPI inference requests with the OpenAI API are made either by:
- OpenAI API clients through the deployed LLM’s inference endpoints.
- The Wallaroo SDK through OpenAI specific methods.
Inference requests override the model’s OpenAI configuration to allow fine tuning of inference request parameters.

Upload LLM to Wallaroo

The following examples demonstrates uploading a LLM to Wallaroo either in the Wallaroo Native vLLM runtime or the Wallaroo Custom vLLM Runtime. Note that at this phase OpenAPI API compatibility is not defined - that is done at the Configure OpenAI API Compatibility step.

Upload LLMs Via the Wallaroo SDK

OpenAI compatibility is configured for LLMs uploaded via the Wallaroo SDK from the following methods:

Define the model upload parameters with wallaroo.client.Client.upload_model method.
- (Optional) Set the upload_model parameter framework_config to specify any vLLM options to increase performance. If no options are specified, the default values are applied.

The method wallaroo.client.Client.upload_model takes the following parameters:

Parameter	Type	Description
`name`	`string` (Required)	The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model.
`path`	`string` (Required)	The path to the model file being uploaded.
`framework`	`string` (Required)	The framework of the model from `wallaroo.framework.Framework`. For native vLLM, this framework is `wallaroo.framework.Framework.VLLM`. For custom vLLM, this framework is `wallaroo.framework.Framework.VLLM`
`input_schema`	`pyarrow.lib.Schema` (Required)	The input schema in Apache Arrow schema format. For OpenAI compatible LLMs, this field is ignored. Best practice is to provide the empty set `pa.schema([])`.
`output_schema`	`pyarrow.lib.Schema` (Required)	The output schema in Apache Arrow schema format. For OpenAI compatible LLMs, this field is ignored. Best practice is to provide the empty set `pa.schema([])`.
`framework_config`	`wallaroo.framework.VLLMConfig` OR `wallaroo.framework.CustomConfig` (Optional)	Sets the vLLM framework configuration options.
`accel`	`wallaroo.engine_config.Acceleration` (Optional)	The optional AI hardware accelerator used. The following options are supported for OpenAI compatibility: `wallaroo.engine_config.Acceleration.QAIC`: Qualcomm Cloud AI. `wallaroo.engine_config.Acceleration.CUDA`: NVIDIA CUDA .
`convert_wait`	`bool` (Optional)	True: Waits in the script for the model conversion completion. False: Proceeds with the script without waiting for the model conversion process to display complete.

The framework configuration must match the appropriate runtimes:

Runtime	Framework Config
`wallaroo.framework.Framework.VLLM`	`wallaroo.framework.VLLMConfig`
`wallaroo.framework.Framework.CUSTOM`	`wallaroo.framework.CustomConfig`

wallaroo.framework.VLLMConfig and wallaroo.framework.CustomConfig contains the following parameters. If no modifications are made at model upload, the default values are applied.

Parameters	Type
max_num_seqs	Integer (Default: 256)
max_model_len	Integer (Default: None)
max_seq_len_to_capture	Integer (Default: 8192)
quantization	(Default: None)
kv_cache_dtype	(Default: `'auto'`)
gpu_memory_utilization	Float (Default: 0.9)
block_size	(Default: None)
device_group	(Default: None) This setting is ignored for CUDA acceleration.

Upload Example for Native vLLM Frameworks via the Wallaroo SDK

The following demonstrates uploading a Native vLLM Runtime with a framework configuration via the Wallaroo SDK.

# (Optional) set the VLLMConfig values
# If no framework configuration value is set, the default values are applied

standard_framework_config = wallaroo.framework.VLLMConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm model with the framework configuration values
vllm_model = wl.upload_model(model_name, \
                              model_file_name, \
                              framework=wallaroo.framework.Framework.VLLM, \
                              input_schema=input_schema, \
                              output_schema=output_schema \
                              framework_config=standard_framework_config,
                              accel=accel
                            )

Upload Example for Custom Frameworks via the Wallaroo SDK

The following demonstrates uploading a Custom Model Runtime with a framework configuration via the Wallaroo SDK. Typically these models are uploaded to provide additional functionality for the native vLLM runtime model. For example, providing Retrieval-Augmented Generation LLMs (RAG), monitoring listeners, etc. The Custom Model Runtime is then deployed in the same Wallaroo pipeline as the native vLLM runtime.

The following example demonstrates uploading a Custom Framework model to Wallaroo through the Wallaroo SDK. Note that no acceleration value is set as opposed to the LLM - in this example, the LLM uses acceleration while the Custom Model does not require it.

# (Optional) set the VLLMConfig values
# If no framework configuration value is set, the default values are applied

custom_framework_config = wallaroo.framework.CustomConfig(
    max_num_seqs=max_num_seqs,
    max_model_len=max_model_len,
    max_seq_len_to_capture=max_seq_len_to_capture, 
    quantization=quantization, 
    kv_cache_dtype=kv_cache_dtype,
    gpu_memory_utilization=gpu_memory_utilization,
    block_size=block_size,
    device_group=None
)

# upload the vllm model with the framework configuration values
custom_model = wl.upload_model(model_name, \
                              model_file_name, \
                              framework=wallaroo.framework.Framework.CUSTOM, \
                              input_schema=input_schema, \
                              output_schema=output_schema \
                              framework_config=standard_framework_config
                            )

Upload LLMs Via the Wallaroo MLOps API

Models are uploaded via the Wallaroo MLOps API via the following endpoint:

/v1/api/models/upload_and_convert

This endpoint accepts the following parameters.

Field		Type	Description
name		String (Required)	The model name.
visibility		String (Required)	Either `public` or `private`.
workspace_id		String (Required)	The numerical ID of the workspace to upload the model to.
conversion		String (Required)	The conversion parameters that include the following:
	framework	String (Required)	The framework of the model being uploaded. For Native vLLM runtimes, this value is `vllm`. For Custom vLLM runtimes, this value is `custom`
	accel	String (Optional) OR Dict (Optional)	The AI accelerator used. For continuous batching, supported types are `cuda` and `qaic`. If using `qaic`, this parameter is either a string to use the default parameters, or as a `Dict` for hardware acceleration parameters. For more details, see LLM Inference with Qualcomm QAIC
	python_version	String (Required)	The version of Python required for the model. For Native and Custom vLLM frameworks, this value is `3.8`.
	requirements	String (Required)	Required libraries. For Native and Custom vLLM frameworks, this value is `[]`.
	framework_config	Dict (Optional)	The framework configuration. See the `framework_config` parameters below for further details.
	input_schema	String (Optional)	The input schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`.
	output_schema	String (Optional)	The output schema from the Apache Arrow `pyarrow.lib.Schema` format, encoded with `base64.b64encode`.

The framework_config parameter accepts the following parameters.

Field		Type
config		Dict The framework configuration values. The following subset are parameters of the `config` field.
	max_num_seqs	Integer (Default: 256)
	max_model_len	Integer (Default: None)
	max_seq_len_to_capture	Integer (Default: 8192)
	quantization	(Default: None)
	kv_cache_dtype	(Default: `'auto'`)
	gpu_memory_utilization	Float (Default: 0.9)
	block_size	(Default: None)
	device_group	(Default: None) This setting is ignored for CUDA acceleration.
framework		String The framework of the `framework_config` type. For Native vLLM frameworks, this value is `"vllm"`. For Custom vLLM frameworks, this value is `"custom"`.

Upload Example for Native vLLM Runtime via the MLOps API

The following example demonstrates uploading a Native vLLM Framework model with the framework configuration via the Wallaroo MLOps API.

# define the input and output parameters in Apache pyarrow format
# the input and output schemas are ignored for OpenAI compatible LLMs, so only an empty set is needed

input_schema = pa.schema([])
output_schema = pa.schema([])

# convert the input and output values to base64
base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")

# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128
# acceleration = CUDA

curl --progress-bar \
    -X POST \
    -H "Content-Type: multipart/form-data" \
    -H "Authorization: Bearer abc123" \
    -F \'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": 6, "conversion": {"arch": "x86", "accel": "cuda", "framework": "vllm", "framework_config": {"config": {"gpu_memory_utilization": 0.9, "kv_cache_dtype": "auto", "max_model_len": 128, "max_num_seqs": 256, "max_seq_len_to_capture": 8192, "quantization": "none"}, "framework": "vllm"}, "python_version": "3.8", "requirements": []}, "input_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA=", "output_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA="};type=application/json\' -F "file=@<file path to vllm>;type=application/octet-stream" \
    https://example.wallaroo.ai/v1/api/models/upload_and_convert

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This is used to apply the OpenAI API compatibility configuration.

# Retrieve the model
vllm_model = wl.get_model(your-model-name)

Upload Example for Custom Model Runtime via the MLOps API

The following example demonstrates uploading a Custom vLLM Framework model with the framework configuration via the Wallaroo MLOps API, then retrieving the model version from the Wallaroo SDK.

# define the input and output parameters in Apache pyarrow format

input_schema = pa.schema([
    pa.field('prompt', pa.string()),
    pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
    pa.field('generated_text', pa.string()),
    pa.field('num_output_tokens', pa.int64())
])

# convert the input and output values to base64
base64.b64encode(
    bytes(input_schema.serialize())
).decode("utf8")

base64.b64encode(
    bytes(output_schema.serialize())
).decode("utf8")

# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128

curl --progress-bar -X POST \
   -H "Content-Type: multipart/form-data" \
   -H "Authorization: Bearer <your-auth-token-here>" \
   -F 'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "custom"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
   -F "file=@<file path to custom vllm runtime>" \
   https://<Wallaroo Hostname>/v1/api/models/upload_and_convert | cat

The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model. This is used to apply the OpenAI API compatibility configuration.

# Retrieve the model
custom_model = wl.get_model(your-model-name)

Configure OpenAI API Compatibility

OpenAI API compatibility is applied to either native or custom vLLM runtimes either before or after the LLM is uploaded to Wallaroo via the Wallaroo SDK.

This includes the following parameters.

Parameter	Type	Description
`openai_config`	from wallaroo.openai_config import OpenaiConfig (Default: None)	Sets the OpenAI API configuration options.

The class wallaroo.openai_config.OpenaiConfig includes the following main parameters. The essential one is enabled - if OpenAI compatibility is not enabled, all other parameters are ignored.

Parameter	Type	Description
`enabled`	Boolean (Default: False)	If `True`, OpenAI compatibility is enabled. If `False`, OpenAI compatibility is not enabled. All other parameters are ignored if `enabled=False`.
`completion_config`	Dict	The OpenAI API `completion` parameters. All `completion` parameters are available except `stream`; the `stream` parameter is only set at inference requests.
`chat_completion_config`	Dict	The OpenAI API `chat/completion` parameters. All `completion` parameters are available except `stream`; the `stream` parameter is only set at inference requests.

Configure OpenAI API Compatibility Example

The following example demonstrates enabling and applying an OpenAI configuration to an uploaded native vLLM Runtime and Custom Model runtime. Note that the OpenAI configuration is the same for either runtime, and the inputs

# enables the OpenAI compatibility and sets `completion_config` and `chat_completion_config` parameters.
openai_config = OpenaiConfig(
    enabled=True,
    completion_config={
        "temperature": .3,
        "max_tokens": 200
    },
    chat_completion_config={
        "temperature": .3,
        "max_tokens": 200,
        "chat_template": """
        {% for message in messages %}
            {% if message['role'] == 'user' %}
                {{ '<|user|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'system' %}
                {{ '<|system|>\n' + message['content'] + eos_token }}
            {% elif message['role'] == 'assistant' %}
                {{ '<|assistant|>\n'  + message['content'] + eos_token }}
            {% endif %}
            
            {% if loop.last and add_generation_prompt %}
                {{ '<|assistant|>' }}
            {% endif %}
        {% endfor %}"""
    })

# enable openai config Native vLLM runtime
vllm_model_openai_configured = vllm_model.configure(openai_config=openai_config)

# enable openai config Custom Model runtime
custom_model_openai_configured = custom_model.configure(openai_config=openai_config)

Deploy LLM with OpenAI Compatibility

Once the OpenAI compatibility for either native vLLM runtimes or Wallaroo Custom Model runtimes is enabled, they are deployed through the following process.

Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
Create a Wallaroo pipeline and add the model(s) as pipeline steps.
- Inference inputs are submitted to the first model step, with their output submitted to the next model step, until the final model step output it returned.
Deploy the Wallaroo pipeline with the deployment configuration.

Deployment Configuration for LLMs

The deployment configuration sets what resources are allocated for model use. For this example, the native vLLM runtime with OpenAI compatibility enabled is allocated:

1 cpus
8 Gi RAM
1 gpu

The specific gpu type is inherited from the upload_model accel parameter; the deployment_label sets which node to use that has the gpu hardware inherited from the model so it is available to the model on deployment.

native_deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(vllm_model_openai_configured, 1) \
    .sidekick_memory(vllm_model_openai_configured, '8Gi') \
    .sidekick_gpus(vllm_model_openai_configured, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

The following example shows the deployment configuration for deploying both a Custom Model runtime and native vLLM runtime with OpenAI compatibility enabled. In this example, the deployment configuration allocates the following:

Custom Model runtime with OpenAI compatibility enabled.
- 1 cpus
- 2 Gi RAM
Native vLLM runtime with OpenAI compatibility enabled.
- 1 cpus
- 8 Gi
- 1 gpu

custom_deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(custom_model_openai_configured, 1) \
    .sidekick_memory(custom_model_openai_configured, '2Gi') \
    .sidekick_cpus(vllm_model_openai_configured, 1) \
    .sidekick_memory(vllm_model_openai_configured, '8Gi') \
    .sidekick_gpus(vllm_model_openai_configured, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

Create Wallaroo Pipeline and Deploy

Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline method. Pipeline steps are used to determine how inference data is provided to the LLM.

The following demonstrates creating a Wallaroo pipeline, and adding the native vLLM with OpenAI compatibility enabled as a pipeline step. Once set, the pipeline is deployed with the defined deployment configuration.

# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-openai-enabled-pipeline')

# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(vllm_model_openai_configured)

# deploy with the deployment configuration

vllm_pipeline.deploy(deployment_config=native_deployment_config)

The following demonstrates creating a Wallaroo pipeline, and adding first the Custom Model with OpenAI compatibility, then the native vLLM with OpenAI compatibility enabled. In this scenario, the outputs from the Custom Model are the inputs for the LLM, with the final model step outputs returned as the inference output.

# create the pipeline
custom_with_vllm_pipeline = wl.build_pipeline('sample-custom-openai-enabled-pipeline')

# add the custom model and LLM as a pipeline model steps
custom_with_vllm_pipeline.add_model_step(custom_model_openai_configured)
custom_with_vllm_pipeline.add_model_step(vllm_model_openai_configured)

# deploy with the deployment configuration
custom_with_vllm_pipeline.add_model_step(deployment_config=custom_deployment_config)

Once deployment is complete, inference requests are accepted via either the Wallaroo SDK or the pipeline’s OpenAI API client endpoint.

How to Publish for Edge Deployment

Wallaroo pipelines are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config) command. This uploads the following artifacts to the OCI registry:

The native vLLM runtimes or custom models with OpenAI compatibility enabled.
If specified, the deployment configuration.
The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.

Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.

For more details, see Edge and Multi-cloud Pipeline Publish.

The following example demonstrates publishing a Wallaroo pipeline with a native vLLM runtime.

pipeline.publish(deployment_config=native_deployment_config)

How to Inference Requests with OpenAI Compatible LLMs

Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom with OpenAI compatibility enabled in Wallaroo are performed either through the Wallaroo SDK, or via OpenAPI endpoint requests.

OpenAI API inference requests on models deployed with OpenAI compatible LLMs have the following conditions:

Parameters for chat/completion and completion override the existing OpenAI configuration options.
If the stream option is enabled:
- Outputs returned as list of chunks aka as an event stream.
- The request inference call completes when all chunks are returned.
- The response metadata includes ttft, tps and user-specified OpenAI request params after the last chunk is generated.

Inference Requests to OpenAI-compatible LLMs Deployed in Wallaroo via the Wallaroo SDK

Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:

wallaroo.pipeline.Pipeline.openai_chat_completion: Submits an inference request using the OpenAI API chat/completion endpoint parameters.
wallaroo.pipeline.Pipeline.openai_completion: Submits an inference request using the OpenAI API completion endpoint parameters.

The following example demonstrates performing an inference requests via the different methods.

openai_chat_completion

pipeline.openai_chat_completion(messages=[{"role": "user", "content": "good morning"}]).choices[0].message.content

"Of course! Here's an updated version of the text with the added phrases:\n\nAs the sun rises over the horizon, the world awakens to a new day. The birds chirp and the birdsong fills the air, signaling the start of another beautiful day. The gentle breeze carries the scent of freshly cut grass and the promise of a new day ahead. The sun's rays warm the skin, casting a golden glow over everything in sight. The world awakens to a new day, a new chapter, a new beginning. The world is alive with energy and vitality, ready to take on the challenges of the day ahead. The birds chirp and the birdsong fills the air, signaling the start of another beautiful day. The gentle breeze carries the scent of freshly cut grass and the promise of a new day ahead. The sun's rays warm the skin"

openai_chat_completion with Token Streaming

# Now with streaming

for chunk in pipeline.openai_chat_completion(messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=100, stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)

Once upon a time, in a small village nestled in the heart of the countryside, there lived a young woman named Lily. Lily was a kind and gentle soul, always looking out for those in need. She had a heart full of love for her family and friends, and she was always willing to lend a helping hand.

One day, Lily met a handsome young man named Jack. Jack was a charming and handsome man, with a

openai_completion

pipeline.openai_completion(prompt="tell me about wallaroo.AI", max_tokens=200).choices[0].text

Wallaroo is a comprehensive platform for building and tracking predictive models. This tool is really helpful in AI development. Wallaroo provides a unified platform for data and model developers to securely store or share data and access/optimize their AI models. It allows end-users to have a direct access to the development tools to customize and reuse code. Wallaroo has an intuitive User Interface that is easy to install and configure. Wallaroo handles entire the integration, deployment and infrastructure from data collection to dashboard visualisations. Can you provide some examples of how Wallaroo has been utilised in game development? Also, talk about the effectiveness of ML training using Wallaroo.'

openai_completion with Token Streaming

for chunk in pipeline.openai_completion(prompt="tell me a short story", max_tokens=300, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

?" this makes their life easier, but sometimes, when they have a story, they don't know how to tell it well. This frustrates them and makes their life even more difficult.

b. Relaxation:
protagonist: take a deep breath and let it out. Why not start with a song? "Eyes full of longing, I need your music to embrace." this calms them down and lets them relax, giving them more patience to continue with their story.

c. Inspirational quotes:
protagonist: this quote from might jeffries helps me reflect on my beliefs and values: "the mind is a powerful thing, it can change your destiny at any time. Fear no fear, only trust your divineline and reclaim your destiny." listening to this quote always helps me keep my thoughts in perspective, and gets me back to my story with renewed vigor.

Inference Requests to OpenAI-Compatible LLMs Deployed in Wallaroo via the OpenAI SDK

Inference requests made through the OpenAI Python SDK require the following:

A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
Access to the deployed pipeline’s OpenAPI API extension endpoints.

The endpoint is:

{Deployment inference endpoint}/openai/v1/

OpenAI SDK inferences on Wallaroo deployed pipelines have the following conditions:

OpenAI inference request params to apply only for their inference request, and override the parameters set at model configuration.
Outputs of streamed fields are returned as list of chunks reflect streamed chunks following the default or configured max_tokens stream tokens for the associated LLM.
The inference call completes when all chunks are streamed.

The following examples demonstrate authenticating to the deployed Wallaroo pipeline with the OpenAI SDK client against the deployment inference endpoint https://example.wallaroo.ai/v1/api/pipelines/infer/samplellm-openai-414/samplellm-openai, and the OpenAI endpoint extension /openai/v1/.

token = wl.auth.auth_header()['Authorization'].split()[1]

from openai import OpenAI
client = OpenAI(
    base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/samplellm-openai-414/samplellm-openai/openai/v1',
    api_key=token
)

The following demonstrates inference requests using completions and chat.completions with and without token streaming enabled.

openai.chat.completions with Token Streaming

for chunk in client.chat.completions.create(model="dummy", messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=1000, stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)

It was a warm summer evening, and the sun was setting over the horizon. A young couple, Alex and Emily, sat on a bench in the park, watching the world go by. Alex had just finished his shift at the local diner, and Emily had just finished her shift at the bookstore. They had been together for a year, and they were in love.

As they sat there, watching the world go by, they started talking about their hopes and dreams for the future. Alex talked about his dream of opening his own restaurant, and Emily talked about her dream of traveling the world. They both knew that their dreams were far from reality, but they were determined to make them come true.

As they talked, they noticed a group of children playing in the park. Alex and Emily walked over to them and asked if they needed any help. The children were excited to see someone new in the park, and they ran over to Alex and Emily, hugging them tightly.

Alex and Emily smiled at the children, feeling a sense of joy and happiness that they had never felt before. They knew that they had found something special in each other, and they were determined to make their love last.

As the night wore on, Alex and Emily found themselves lost in each other's arms. They had never felt so alive, so in love, and they knew that they would never forget this moment.

As the night came to a close, Alex and Emily stood up and walked back to their bench. They looked at each other, feeling a sense of gratitude and joy that they had never felt before. They knew that their love was worth fighting for, and they were determined to make it last.

From that moment on, Alex and Emily knew that their love was worth fighting for. They knew that they had found something special in each other, something that would last a lifetime. They knew that they would always be together, no matter what life threw their way.

And so, they sat on their bench, watching the world go by, knowing that their love was worth fighting for, and that they would always be together, no matter what the future held.

openai.chat.completions

response = client.chat.completions.create(model="",messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100)
print(response.choices[0].message.content)

Thank you for admiring my writing skills! Here's an example of how to use a greeting in a sentence:

Syntax sentence: "Excuse me, but can I have a moment of your time?"

Meaning: I am a friendly and polite person who is looking for brief conversation with someone else.

The response from the person in question could be: "Sure, let me give it a try."

**Imagery sentences

openai.completions with Token Streaming

for chunk in client.completions.create(model="", prompt="tell me a short story", max_tokens=100, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

Authors have written ingenious stories about small country people, small towns, and small lifestyles. Here’s one that is light and entertaining:

Title: The Big Cheese of High School

Episode One: Annabelle, a sophomore in high school, has just missed the kiss-off that seemed like just a hiccup. However, Annabelle is a gentle soul, and her pendulum swings further out of control

openai.completions

client.completions.create(model="", prompt="tell me a short story", max_tokens=100).choices[0].text

" to keep me awake at night. - a quick story to put on hold till brighter times - How Loki's cylinder isn't meaningful anymore; remember that Loki is the lying one!\nthese last two sentences could be sophisticated supporting context sentences that emphasizes Loki's comedy presence - emphasize the exaggerated quality of Imogen's hyperactive relationships, and how she helps Loki to laugh - or if you want a plot"

Inference Requests to OpenAI-Compatible LLMs Deployed in Wallaroo via Wallaroo Inference OpenAI Endpoints

Native vLLM runtimes and Wallaroo Custom Models with OpenAI enabled perform inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions.

These requests require the following:

A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
Access to the deployed pipeline’s OpenAPI API extension endpoints.

For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:

{Deployment inference endpoint}/openai/v1/completions: Compatible with the OpenAI API endpoint completion.
{Deployment inference endpoint}/openai/v1/chat/completions: Compatible with the OpenAI API endpoint chat/completion.

These endpoints use the OpenAI API parameters with the following conditions:

Any OpenAI model customizations made are overridden by any parameters in the inference request.
The stream parameter is available to provide token streaming.

When the stream option is enabled to provide token streaming, the following apply:

Outputs returned asynchronously as list of chunks aka as an event stream.
The API inference call completes when all chunks are returned.
The response metadata includes ttft, tps and user-specified OpenAI request params after the last chunk is generated.

chat/completion Inference Token Streaming Example

The following demonstrates using OpenAI API compatible endpoint chat/completions on a pipeline deployed in Wallaroo with token streaming enabled.

Note that the response when token streaming is enabled is returned asynchronously as a list of chunks.

curl -X POST \
  -H "Authorization: Bearer abcdefg"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100, "stream": true}' \
  https://api.example.wallaroo.ai/v1/api/pipelines/infer/sampleopenaipipeline-260/sampleopenaipipeline/openai/v1/chat/completions

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":"assistant","content":""}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"I"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" am"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" ec"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"lect"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"ic"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"ils"}}],"usage":null}

...

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":"length","message":null,"delta":{"role":null,"content":","}}],"usage":null}

data: [DONE]

chat/completion Inference Example

The following example demonstrates performing an inference request using the deployed pipeline’s chat/completion endpoint without token streaming enabled.

curl -X POST \
  -H "Authorization: Bearer abc123"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100}' \
  https://api.example.wallaroo.ai/v1/api/pipelines/infer/sampleopenaipipeline-260/sampleopenaipipeline/openai/v1/chat/completions

{"choices":[{"delta":null,"finish_reason":"length","index":0,"message":{"content":"I am a storyteller. I strive to put words to my experiences and imaginations, telling stories that capture the heart and imagination of audiences around the world. Whether I'm sharing tales of adventure, hope, and love, or simply sharing the excitement of grand-kid opening presents on Christmas morning, I've always felt a deep calling to tell tales that inspire, uplift, and bring joy to those who hear them. From small beginn","role":"assistant","tool_calls":[]}}],"created":1748984273,"id":"chatcmpl-b26e7e82265f4e4287effe7d84914bf9","model":"vllm-openai_tinyllama.zip","object":"chat.completion","usage":{"completion_tokens":100,"prompt_tokens":49,"total_tokens":149,"tps":null,"ttft":null}}

completions Inference with Token Streaming Example

The following example demonstrates performing an inference request using the deployed pipeline’s completions endpoint with token streaming enabled.

curl -X POST \
  -H "Authorization: Bearer abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100, "stream": true}' \
  https://api.example.wallaroo.ai/v1/api/pipelines/infer/sampleopenaipipeline-260/sampleopenaipipeline/openai/v1/completions

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" in","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" third","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" person","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" om","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":"nis","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

...

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[],"usage":{"prompt_tokens":27,"completion_tokens":100,"total_tokens":127,"ttft":0.023214041,"tps":93.92361686654164}}

data: [DONE]

completions Inference Example

The following example demonstrates performing an inference request using the deployed pipeline’s completions endpoint without token streaming enabled.

curl -X POST \
  -H "Authorization: Bearer abc123"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100}' \
  https://api.example.wallaroo.ai/v1/api/pipelines/infer/sampleopenaipipeline-260/sampleopenaipipeline/openai/v1/completions

{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" about your summer vacation!\n\n- B - Inyl Convenience Store, Japan\n- Context: MUST BE SET IN AN AMERICAN SUMMER VACATION\n\nhow was your recent trip to japan?\n\n- A - On a cruise ship to Hawaii\n- Context: MUST START EVERY SENTENCE WITH \"How was your recent trip to\"\n\ndo you have any vacation plans for the summer?"}],"created":1748984246,"id":"cmpl-d93de2bad19f479c8a90bc00a5138092","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}

Inference Requests to OpenAI Compatible LLMs Deployed on Edge

Inference requests to edge deployed LLMs with OpenAI compatibilty are made to the following endpoints:

{hostname}/infer/openai/v1/completions: Compatible with the OpenAI API endpoint completions.
{hostname}/infer/openai/v1/chat/completions: Compatible with the OpenAI API endpoint chat/completions.

These endpoints use the OpenAI API parameters with the following conditions:

Any OpenAI model customizations made are overridden by any parameters in the inference request.
The stream parameter is available to provide token streaming.

When the stream option is enabled to provide token streaming, the following apply:

Outputs returned asynchronously as list of chunks aka as an event stream.
The API inference call completes when all chunks are returned.
The response metadata includes ttft, tps and user-specified OpenAI request params after the last chunk is generated.

The following example demonstrates performing an inference request on the endpoint completions on a LLM with OpenAI compatibilty enabled deployed to an edge device.

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100}' \
  http://edge.sample.wallaroo.ai/infer/openai/v1/completions

{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" about your summer vacation!\n\n- B - Inyl Convenience Store, Japan\n- Context: MUST BE SET IN AN AMERICAN SUMMER VACATION\n\nhow was your recent trip to japan?\n\n- A - On a cruise ship to Hawaii\n- Context: MUST START EVERY SENTENCE WITH \"How was your recent trip to\"\n\ndo you have any vacation plans for the summer?"}],"created":1748984246,"id":"cmpl-d93de2bad19f479c8a90bc00a5138092","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}

The following example demonstrates performing an inference request on the endpoint chat/completions.

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100}' \
  http://edge.sample.wallaroo.ai/infer/openai/v1/chat/completions

{"choices":[{"delta":null,"finish_reason":"length","index":0,"message":{"content":"I am a storyteller. I strive to put words to my experiences and imaginations, telling stories that capture the heart and imagination of audiences around the world. Whether I'm sharing tales of adventure, hope, and love, or simply sharing the excitement of grand-kid opening presents on Christmas morning, I've always felt a deep calling to tell tales that inspire, uplift, and bring joy to those who hear them. From small beginn","role":"assistant","tool_calls":[]}}],"created":1748984273,"id":"chatcmpl-b26e7e82265f4e4287effe7d84914bf9","model":"vllm-openai_tinyllama.zip","object":"chat.completion","usage":{"completion_tokens":100,"prompt_tokens":49,"total_tokens":149,"tps":null,"ttft":null}}

How to Observe OpenAI API Enabled Inference Results Metrics

Inference results from native vLLM and Wallaroo Custom Model runtimes provides the metrics:

Tracking time to first token (ttft)
Tokens per second (tps)

These results are provided in the Wallaroo Dashboard and the Wallaroo SDK inference logs.

Viewing OpenAI Metrics UI Through the Wallaroo Dashboard

The OpenAI API metrics ttft and tps are provided through the Wallaroo Dashboard Pipeline Inference Metrics and Logs page.

TTFT	TPS

To access the Wallaroo Dashboard Pipeline Inference Metrics and Logs page:

Login to the Wallaroo Dashboard.
Select the workspace the pipeline is associated with.
Select View Pipelines.
From the Workspace Pipeline List page, select the pipeline.
From the Pipeline Details page, select Metrics.

Viewing OpenAI Metrics through the Wallaroo SDK

The OpenAI metrics are provided as part of the pipeline inference logs and include the following values:

ttft
tps
The OpenAI request parameter values set during the inference request.

The method wallaroo.pipeline.Pipeline.logs returns a pandas DataFrame by default, with the output fields labeled out.{field}. For OpenAI inference requests, the OpenAI metrics output field is out.json. The following demonstrates retrieving the most recent inference results log and displaying the out.json field, which includes the tps and ttft fields.

pipeline.logs().iloc[-1]['out.json']

"{"choices":[{"delta":null,"finish_reason":null,"index":0,"message":{"content":"I am not capable of writing short stories, but I can provide you with a sample short story that follows the basic structure of a classic short story.\n\ntitle: the magic carpet\n\nsetting: a desert landscape\n\ncharacters:\n- abdul, a young boy\n- the magician, a wise old man\n- the carpet, a magical carpet made of gold and silver\n\nplot:\nabdul, a young boy, is wandering through the desert when he stumbles upon a magical carpet. The carpet is made of gold and silver, and it seems to have a magic power.\n\nabdul is fascinated by the carpet and decides to follow it. The carpet takes him on a magical journey, and he meets a group of animals who are also on a quest. Together, they encounter a dangerous dragon and a wise old owl who teaches them about the power of friendship and the importance of following one's dreams.\n\nas they journey on, the carpet takes abdul to a magical land filled with wonder and beauty. The land is filled with creatures that are unlike anything he has ever seen before, and he meets a group of magical beings who help him on his quest.\n\nfinally, abdul arrives at the throne of the king of the land, who has been waiting for him. The king is impressed by abdul's bravery and asks him to become his trusted servant.\n\nas abdul becomes the king's trusted servant, he learns the true meaning of friendship and the importance of following one's dreams. He returns home a changed man, with a newfound sense of purpose and a newfound love for the desert and its magic.\n\nconclusion:\nthe magic carpet is a classic short story that captures the imagination of readers with its vivid descriptions, magical elements, and heartwarming storyline. It teaches the importance of following one's dreams and the power of friendship, and its lessons continue to inspire generations of readers.","role":null}}],"created":1751310038,"id":"chatcmpl-a2893a0812e84cb696be1137681dcd85","model":"vllm-openai_tinyllama.zip","object":"chat.completion.chunk","usage":{"completion_tokens":457,"prompt_tokens":60,"total_tokens":517,"tps":94.23930050755882,"ttft":0.025588177}}"

Tutorials

Troubleshooting

OpenAI Inference Request without OpenAI Compatibility Enabled

When sending an inference request to a Wallaroo inference pipeline endpoint using the Wallaroo SDK or API in the Wallaroo ops center or at the edge with OpenAI compatible payloads in the inference request, AND the underlying model does not have OpenAI configurations enabled,
Then the following error message is displayed in Wallaroo: "Inference failed. Please apply the appropriate OpenAI configurations to the models deployed in this pipeline. For additional help contact support@wallaroo.ai or your Wallaroo technical representative."

OpenAI Compatibility Enabled without OpenAI Inference Request

When sending an inference request to a Wallaroo inference pipeline endpoint using the Wallaroo SDK or API in the Wallaroo ops center or at the edge, AND the underlying model does have Openai configurations enabled, AND inference endpoint request is missing the completion extensions,
Then the following error message is displayed in Wallaroo: "Inference failed. Please apply the appropriate OpenAI extensions to the inference endpoint. For additional help contact support@wallaroo.ai or your Wallaroo technical representative."