Deploy LLMs with OpenAI Compatibility
For access to sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Table of Contents
Wallaroo supports deploying LLMs with OpenAI compatibility. This provides developers and data scientists an easy migration path with their existing OpenAI API deployments while leveraging Wallaroo’s resource optimization to improve user experience while reduce latency and costs.
The following options with Wallaroo OpenAI compatibility are supported:
- Token Streaming: Wallaroo supports the OpenAI API token streaming methods. This is supported either through the Wallaroo SDK or through the Wallaroo OpenAI API inference methods for
completion
andchat/completion
. - AI Acceleration: Deploy LLMs with token streaming with NVIDIA CUDA or Qualcomm Cloud AI AI acceleration.
- Continuous Batching: Wallaroo Continuous Batching provides increased LLM performance on GPUs, leveraging configurable concurrent batch sizes at the Wallaroo inference serving layer.
For access to these sample models and a demonstration on using LLMs with Wallaroo:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:
wallaroo.framework.Framework.VLLM
: Native async vLLM implementations.wallaroo.framework.Framework.CUSTOM
: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.
How to Configure LLM Deployment with OpenAI API Compatibility
LLM deployment with OpenAI API compatibility applied through the following process.
- Models are uploaded to Wallaroo either in the native vLLM or Custom Model frameworks.
- Any AI acceleration settings are applied at model upload.
- OpenAI API compatibility is enabled in the model configuration either during or after model upload.
- Once enabled, all OpenAI API parameters for
completion
andchat/completion
are available exceptstream
; thestream
parameter is only available with requested at inference requests.
- Once enabled, all OpenAI API parameters for
- LLMs are deployed with resource configurations that allocate resources to the LLM’s exclusive use (cpus, memory, gpus, etc).
- OpenAPI inference requests with the OpenAI API are made either by:
- OpenAI API clients through the deployed LLM’s inference endpoints.
- The Wallaroo SDK through OpenAI specific methods.
- Inference requests override the model’s OpenAI configuration to allow fine tuning of inference request parameters.
Upload LLM to Wallaroo
The following examples demonstrates uploading a LLM to Wallaroo either in the Wallaroo Native vLLM runtime or the Wallaroo Custom vLLM Runtime. Note that at this phase OpenAPI API compatibility is not defined - that is done at the Configure OpenAI API Compatibility step.
Upload LLMs Via the Wallaroo SDK
OpenAI compatibility is configured for LLMs uploaded via the Wallaroo SDK from the following methods:
- Define the model upload parameters with
wallaroo.client.Client.upload_model
method.- (Optional) Set the
upload_model
parameterframework_config
to specify any vLLM options to increase performance. If no options are specified, the default values are applied.
- (Optional) Set the
The method wallaroo.client.Client.upload_model
takes the following parameters:
Parameter | Type | Description |
---|---|---|
name | string (Required) | The name of the model. Model names are unique per workspace. Models that are uploaded with the same name are assigned as a new version of the model. |
path | string (Required) | The path to the model file being uploaded. |
framework | string (Required) | The framework of the model from wallaroo.framework.Framework . For native vLLM, this framework is wallaroo.framework.Framework.VLLM . For custom vLLM, this framework is wallaroo.framework.Framework.VLLM |
input_schema | pyarrow.lib.Schema (Required) | The input schema in Apache Arrow schema format. For OpenAI compatible LLMs, this field is ignored. Best practice is to provide the empty set pa.schema([]) . |
output_schema | pyarrow.lib.Schema (Required) | The output schema in Apache Arrow schema format. For OpenAI compatible LLMs, this field is ignored. Best practice is to provide the empty set pa.schema([]) . |
framework_config | wallaroo.framework.VLLMConfig OR wallaroo.framework.CustomConfig (Optional) | Sets the vLLM framework configuration options. |
accel | wallaroo.engine_config.Acceleration (Optional) | The optional AI hardware accelerator used. The following options are supported for OpenAI compatibility:
|
convert_wait | bool (Optional) |
|
The framework configuration must match the appropriate runtimes:
Runtime | Framework Config |
---|---|
wallaroo.framework.Framework.VLLM | wallaroo.framework.VLLMConfig |
wallaroo.framework.Framework.CUSTOM | wallaroo.framework.CustomConfig |
wallaroo.framework.VLLMConfig
and wallaroo.framework.CustomConfig
contains the following parameters. If no modifications are made at model upload, the default values are applied.
Parameters | Type |
---|---|
max_num_seqs | Integer (Default: 256) |
max_model_len | Integer (Default: None) |
max_seq_len_to_capture | Integer (Default: 8192) |
quantization | (Default: None) |
kv_cache_dtype | (Default: 'auto' ) |
gpu_memory_utilization | Float (Default: 0.9) |
block_size | (Default: None) |
device_group | (Default: None) This setting is ignored for CUDA acceleration. |
Upload Example for Native vLLM Frameworks via the Wallaroo SDK
The following demonstrates uploading a Native vLLM Runtime with a framework configuration via the Wallaroo SDK.
# (Optional) set the VLLMConfig values
# If no framework configuration value is set, the default values are applied
standard_framework_config = wallaroo.framework.VLLMConfig(
max_num_seqs=max_num_seqs,
max_model_len=max_model_len,
max_seq_len_to_capture=max_seq_len_to_capture,
quantization=quantization,
kv_cache_dtype=kv_cache_dtype,
gpu_memory_utilization=gpu_memory_utilization,
block_size=block_size,
device_group=None
)
# upload the vllm model with the framework configuration values
vllm_model = wl.upload_model(model_name, \
model_file_name, \
framework=wallaroo.framework.Framework.VLLM, \
input_schema=input_schema, \
output_schema=output_schema \
framework_config=standard_framework_config,
accel=accel
)
Upload Example for Custom Frameworks via the Wallaroo SDK
The following demonstrates uploading a Custom Model Runtime with a framework configuration via the Wallaroo SDK. Typically these models are uploaded to provide additional functionality for the native vLLM runtime model. For example, providing Retrieval-Augmented Generation LLMs (RAG), monitoring listeners, etc. The Custom Model Runtime is then deployed in the same Wallaroo pipeline as the native vLLM runtime.
The following example demonstrates uploading a Custom Framework model to Wallaroo through the Wallaroo SDK. Note that no acceleration value is set as opposed to the LLM - in this example, the LLM uses acceleration while the Custom Model does not require it.
# (Optional) set the VLLMConfig values
# If no framework configuration value is set, the default values are applied
custom_framework_config = wallaroo.framework.CustomConfig(
max_num_seqs=max_num_seqs,
max_model_len=max_model_len,
max_seq_len_to_capture=max_seq_len_to_capture,
quantization=quantization,
kv_cache_dtype=kv_cache_dtype,
gpu_memory_utilization=gpu_memory_utilization,
block_size=block_size,
device_group=None
)
# upload the vllm model with the framework configuration values
custom_model = wl.upload_model(model_name, \
model_file_name, \
framework=wallaroo.framework.Framework.CUSTOM, \
input_schema=input_schema, \
output_schema=output_schema \
framework_config=standard_framework_config
)
Upload LLMs Via the Wallaroo MLOps API
Models are uploaded via the Wallaroo MLOps API via the following endpoint:
/v1/api/models/upload_and_convert
This endpoint accepts the following parameters.
Field | Type | Description | |
---|---|---|---|
name | String (Required) | The model name. | |
visibility | String (Required) | Either public or private . | |
workspace_id | String (Required) | The numerical ID of the workspace to upload the model to. | |
conversion | String (Required) | The conversion parameters that include the following: | |
framework | String (Required) | The framework of the model being uploaded. For Native vLLM runtimes, this value is vllm . For Custom vLLM runtimes, this value is custom | |
accel | String (Optional) OR Dict (Optional) | The AI accelerator used. For continuous batching, supported types are cuda and qaic . If using qaic , this parameter is either a string to use the default parameters, or as a Dict for hardware acceleration parameters. For more details, see LLM Inference with Qualcomm QAIC | |
python_version | String (Required) | The version of Python required for the model. For Native and Custom vLLM frameworks, this value is 3.8 . | |
requirements | String (Required) | Required libraries. For Native and Custom vLLM frameworks, this value is [] . | |
framework_config | Dict (Optional) | The framework configuration. See the framework_config parameters below for further details. | |
input_schema | String (Optional) | The input schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode . | |
output_schema | String (Optional) | The output schema from the Apache Arrow pyarrow.lib.Schema format, encoded with base64.b64encode . |
The framework_config
parameter accepts the following parameters.
Field | Type | Description | |
---|---|---|---|
config | Dict The framework configuration values. The following subset are parameters of the config field. | ||
max_num_seqs | Integer (Default: 256) | ||
max_model_len | Integer (Default: None) | ||
max_seq_len_to_capture | Integer (Default: 8192) | ||
quantization | (Default: None) | ||
kv_cache_dtype | (Default: 'auto' ) | ||
gpu_memory_utilization | Float (Default: 0.9) | ||
block_size | (Default: None) | ||
device_group | (Default: None) This setting is ignored for CUDA acceleration. | ||
framework | String The framework of the framework_config type. For Native vLLM frameworks, this value is "vllm" . For Custom vLLM frameworks, this value is "custom" . |
Upload Example for Native vLLM Runtime via the MLOps API
The following example demonstrates uploading a Native vLLM Framework model with the framework configuration via the Wallaroo MLOps API.
# define the input and output parameters in Apache pyarrow format
# the input and output schemas are ignored for OpenAI compatible LLMs, so only an empty set is needed
input_schema = pa.schema([])
output_schema = pa.schema([])
# convert the input and output values to base64
base64.b64encode(
bytes(input_schema.serialize())
).decode("utf8")
base64.b64encode(
bytes(output_schema.serialize())
).decode("utf8")
# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128
# acceleration = CUDA
curl --progress-bar \
-X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer abc123" \
-F \'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": 6, "conversion": {"arch": "x86", "accel": "cuda", "framework": "vllm", "framework_config": {"config": {"gpu_memory_utilization": 0.9, "kv_cache_dtype": "auto", "max_model_len": 128, "max_num_seqs": 256, "max_seq_len_to_capture": 8192, "quantization": "none"}, "framework": "vllm"}, "python_version": "3.8", "requirements": []}, "input_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA=", "output_schema": "/////zAAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAAAAAA="};type=application/json\' -F "file=@<file path to vllm>;type=application/octet-stream" \
https://example.wallaroo.ai/v1/api/models/upload_and_convert
The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model
. This is used to apply the OpenAI API compatibility configuration.
# Retrieve the model
vllm_model = wl.get_model(your-model-name)
Upload Example for Custom Model Runtime via the MLOps API
The following example demonstrates uploading a Custom vLLM Framework model with the framework configuration via the Wallaroo MLOps API, then retrieving the model version from the Wallaroo SDK.
# define the input and output parameters in Apache pyarrow format
input_schema = pa.schema([
pa.field('prompt', pa.string()),
pa.field('max_tokens', pa.int64()),
])
output_schema = pa.schema([
pa.field('generated_text', pa.string()),
pa.field('num_output_tokens', pa.int64())
])
# convert the input and output values to base64
base64.b64encode(
bytes(input_schema.serialize())
).decode("utf8")
base64.b64encode(
bytes(output_schema.serialize())
).decode("utf8")
# upload via the Wallaroo MLOps API endpoint using curl
# framework configuration with gpu_memory_utilization=0.9 and max_model_len=128
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer <your-auth-token-here>" \
-F 'metadata={"name": "<your-model-name>", "visibility": "private", "workspace_id": <your-workspace-id-here>, "conversion": {"framework": "custom", "python_version": "3.8", "requirements": [], "framework_config": {"config": {"gpu_memory_utilization": 0.9, "max_model_len": 128}, "framework": "custom"}}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
-F "file=@<file path to custom vllm runtime>" \
https://<Wallaroo Hostname>/v1/api/models/upload_and_convert | cat
The model is retrieved via the Wallaroo SDK method wallaroo.client.Client.get_model
. This is used to apply the OpenAI API compatibility configuration.
# Retrieve the model
custom_model = wl.get_model(your-model-name)
Configure OpenAI API Compatibility
OpenAI API compatibility is applied to either native or custom vLLM runtimes either before or after the LLM is uploaded to Wallaroo via the Wallaroo SDK.
This includes the following parameters.
Parameter | Type | Description |
---|---|---|
openai_config | from wallaroo.openai_config import OpenaiConfig (Default: None) | Sets the OpenAI API configuration options. |
The class wallaroo.openai_config.OpenaiConfig
includes the following main parameters. The essential one is enabled
- if OpenAI compatibility is not enabled, all other parameters are ignored.
Parameter | Type | Description |
---|---|---|
enabled | Boolean (Default: False) | If True , OpenAI compatibility is enabled. If False , OpenAI compatibility is not enabled. All other parameters are ignored if enabled=False . |
completion_config | Dict | The OpenAI API completion parameters. All completion parameters are available except stream ; the stream parameter is only set at inference requests. |
chat_completion_config | Dict | The OpenAI API chat/completion parameters. All completion parameters are available except stream ; the stream parameter is only set at inference requests. |
Configure OpenAI API Compatibility Example
The following example demonstrates enabling and applying an OpenAI configuration to an uploaded native vLLM Runtime and Custom Model runtime. Note that the OpenAI configuration is the same for either runtime, and the inputs
# enables the OpenAI compatibility and sets `completion_config` and `chat_completion_config` parameters.
openai_config = OpenaiConfig(
enabled=True,
completion_config={
"temperature": .3,
"max_tokens": 200
},
chat_completion_config={
"temperature": .3,
"max_tokens": 200,
"chat_template": """
{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>\n' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>\n' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>\n' + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}"""
})
# enable openai config Native vLLM runtime
vllm_model_openai_configured = vllm_model.configure(openai_config=openai_config)
# enable openai config Custom Model runtime
custom_model_openai_configured = custom_model.configure(openai_config=openai_config)
Deploy LLM with OpenAI Compatibility
Once the OpenAI compatibility for either native vLLM runtimes or Wallaroo Custom Model runtimes is enabled, they are deployed through the following process.
- Define the deployment configuration to set the number of CPUs, RAM, and GPUs per replica.
- Create a Wallaroo pipeline and add the model(s) as pipeline steps.
- Inference inputs are submitted to the first model step, with their output submitted to the next model step, until the final model step output it returned.
- Deploy the Wallaroo pipeline with the deployment configuration.
Deployment Configuration for LLMs
The deployment configuration sets what resources are allocated for model use. For this example, the native vLLM runtime with OpenAI compatibility enabled is allocated:
- 1 cpus
- 8 Gi RAM
- 1 gpu
The specific gpu type is inherited from the upload_model
accel
parameter; the deployment_label
sets which node to use that has the gpu hardware inherited from the model so it is available to the model on deployment.
native_deployment_config = wallaroo.DeploymentConfigBuilder() \
.replica_count(1) \
.cpus(.5) \
.memory("1Gi") \
.sidekick_cpus(vllm_model_openai_configured, 1) \
.sidekick_memory(vllm_model_openai_configured, '8Gi') \
.sidekick_gpus(vllm_model_openai_configured, 1) \
.deployment_label('wallaroo.ai/accelerator:l4') \
.build()
The following example shows the deployment configuration for deploying both a Custom Model runtime and native vLLM runtime with OpenAI compatibility enabled. In this example, the deployment configuration allocates the following:
- Custom Model runtime with OpenAI compatibility enabled.
- 1 cpus
- 2 Gi RAM
- Native vLLM runtime with OpenAI compatibility enabled.
- 1 cpus
- 8 Gi
- 1 gpu
custom_deployment_config = wallaroo.DeploymentConfigBuilder() \
.replica_count(1) \
.cpus(.5) \
.memory("1Gi") \
.sidekick_cpus(custom_model_openai_configured, 1) \
.sidekick_memory(custom_model_openai_configured, '2Gi') \
.sidekick_cpus(vllm_model_openai_configured, 1) \
.sidekick_memory(vllm_model_openai_configured, '8Gi') \
.sidekick_gpus(vllm_model_openai_configured, 1) \
.deployment_label('wallaroo.ai/accelerator:l4') \
.build()
Create Wallaroo Pipeline and Deploy
Wallaroo pipelines are created with the wallaroo.client.Client.build_pipeline
method. Pipeline steps are used to determine how inference data is provided to the LLM.
The following demonstrates creating a Wallaroo pipeline, and adding the native vLLM with OpenAI compatibility enabled as a pipeline step. Once set, the pipeline is deployed with the defined deployment configuration.
# create the pipeline
vllm_pipeline = wl.build_pipeline('sample-vllm-openai-enabled-pipeline')
# add the LLM as a pipeline model step
vllm_pipeline.add_model_step(vllm_model_openai_configured)
# deploy with the deployment configuration
vllm_pipeline.deploy(deployment_config=native_deployment_config)
The following demonstrates creating a Wallaroo pipeline, and adding first the Custom Model with OpenAI compatibility, then the native vLLM with OpenAI compatibility enabled. In this scenario, the outputs from the Custom Model are the inputs for the LLM, with the final model step outputs returned as the inference output.
# create the pipeline
custom_with_vllm_pipeline = wl.build_pipeline('sample-custom-openai-enabled-pipeline')
# add the custom model and LLM as a pipeline model steps
custom_with_vllm_pipeline.add_model_step(custom_model_openai_configured)
custom_with_vllm_pipeline.add_model_step(vllm_model_openai_configured)
# deploy with the deployment configuration
custom_with_vllm_pipeline.add_model_step(deployment_config=custom_deployment_config)
Once deployment is complete, inference requests are accepted via either the Wallaroo SDK or the pipeline’s OpenAI API client endpoint.
How to Publish for Edge Deployment
Wallaroo pipelines are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config)
command. This uploads the following artifacts to the OCI registry:
- The native vLLM runtimes or custom models with OpenAI compatibility enabled.
- If specified, the deployment configuration.
- The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.
Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.
For more details, see Edge and Multi-cloud Pipeline Publish.
The following example demonstrates publishing a Wallaroo pipeline with a native vLLM runtime.
pipeline.publish(deployment_config=native_deployment_config)
How to Inference Requests with OpenAI Compatible LLMs
Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom with OpenAI compatibility enabled in Wallaroo are performed either through the Wallaroo SDK, or via OpenAPI endpoint requests.
OpenAI API inference requests on models deployed with OpenAI compatible LLMs have the following conditions:
- Parameters for
chat/completion
andcompletion
override the existing OpenAI configuration options. - If the
stream
option is enabled:- Outputs returned as list of chunks aka as an event stream.
- The request inference call completes when all chunks are returned.
- The response metadata includes
ttft
,tps
and user-specified OpenAI request params after the last chunk is generated.
Inference Requests to OpenAI-compatible LLMs Deployed in Wallaroo via the Wallaroo SDK
Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:
wallaroo.pipeline.Pipeline.openai_chat_completion
: Submits an inference request using the OpenAI APIchat/completion
endpoint parameters.wallaroo.pipeline.Pipeline.openai_completion
: Submits an inference request using the OpenAI APIcompletion
endpoint parameters.
The following example demonstrates performing an inference requests via the different methods.
openai_chat_completion
pipeline.openai_chat_completion(messages=[{"role": "user", "content": "good morning"}]).choices[0].message.content
"Of course! Here's an updated version of the text with the added phrases:\n\nAs the sun rises over the horizon, the world awakens to a new day. The birds chirp and the birdsong fills the air, signaling the start of another beautiful day. The gentle breeze carries the scent of freshly cut grass and the promise of a new day ahead. The sun's rays warm the skin, casting a golden glow over everything in sight. The world awakens to a new day, a new chapter, a new beginning. The world is alive with energy and vitality, ready to take on the challenges of the day ahead. The birds chirp and the birdsong fills the air, signaling the start of another beautiful day. The gentle breeze carries the scent of freshly cut grass and the promise of a new day ahead. The sun's rays warm the skin"
openai_chat_completion
with Token Streaming
# Now with streaming
for chunk in pipeline.openai_chat_completion(messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=100, stream=True):
print(chunk.choices[0].delta.content, end="", flush=True)
Once upon a time, in a small village nestled in the heart of the countryside, there lived a young woman named Lily. Lily was a kind and gentle soul, always looking out for those in need. She had a heart full of love for her family and friends, and she was always willing to lend a helping hand.
One day, Lily met a handsome young man named Jack. Jack was a charming and handsome man, with a
openai_completion
pipeline.openai_completion(prompt="tell me about wallaroo.AI", max_tokens=200).choices[0].text
Wallaroo is a comprehensive platform for building and tracking predictive models. This tool is really helpful in AI development. Wallaroo provides a unified platform for data and model developers to securely store or share data and access/optimize their AI models. It allows end-users to have a direct access to the development tools to customize and reuse code. Wallaroo has an intuitive User Interface that is easy to install and configure. Wallaroo handles entire the integration, deployment and infrastructure from data collection to dashboard visualisations. Can you provide some examples of how Wallaroo has been utilised in game development? Also, talk about the effectiveness of ML training using Wallaroo.'
openai_completion
with Token Streaming
for chunk in pipeline.openai_completion(prompt="tell me a short story", max_tokens=300, stream=True):
print(chunk.choices[0].text, end="", flush=True)
?" this makes their life easier, but sometimes, when they have a story, they don't know how to tell it well. This frustrates them and makes their life even more difficult.
b. Relaxation:
protagonist: take a deep breath and let it out. Why not start with a song? "Eyes full of longing, I need your music to embrace." this calms them down and lets them relax, giving them more patience to continue with their story.
c. Inspirational quotes:
protagonist: this quote from might jeffries helps me reflect on my beliefs and values: "the mind is a powerful thing, it can change your destiny at any time. Fear no fear, only trust your divineline and reclaim your destiny." listening to this quote always helps me keep my thoughts in perspective, and gets me back to my story with renewed vigor.
Inference Requests to OpenAI-Compatible LLMs Deployed in Wallaroo via the OpenAI SDK
Inference requests made through the OpenAI Python SDK require the following:
- A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
- Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
- Access to the deployed pipeline’s OpenAPI API extension endpoints.
The endpoint is:
{Deployment inference endpoint}/openai/v1/
OpenAI SDK inferences on Wallaroo deployed pipelines have the following conditions:
- OpenAI inference request params to apply only for their inference request, and override the parameters set at model configuration.
- Outputs of streamed fields are returned as list of chunks reflect streamed chunks following the default or configured
max_tokens
stream tokens for the associated LLM. - The inference call completes when all chunks are streamed.
The following examples demonstrate authenticating to the deployed Wallaroo pipeline with the OpenAI SDK client against the deployment inference endpoint https://example.wallaroo.ai/v1/api/pipelines/infer/samplellm-openai-414/samplellm-openai
, and the OpenAI endpoint extension /openai/v1/
.
token = wl.auth.auth_header()['Authorization'].split()[1]
from openai import OpenAI
client = OpenAI(
base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/samplellm-openai-414/samplellm-openai/openai/v1',
api_key=token
)
The following demonstrates inference requests using completions
and chat.completions
with and without token streaming enabled.
openai.chat.completions
with Token Streaming
for chunk in client.chat.completions.create(model="dummy", messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=1000, stream=True):
print(chunk.choices[0].delta.content, end="", flush=True)
It was a warm summer evening, and the sun was setting over the horizon. A young couple, Alex and Emily, sat on a bench in the park, watching the world go by. Alex had just finished his shift at the local diner, and Emily had just finished her shift at the bookstore. They had been together for a year, and they were in love.
As they sat there, watching the world go by, they started talking about their hopes and dreams for the future. Alex talked about his dream of opening his own restaurant, and Emily talked about her dream of traveling the world. They both knew that their dreams were far from reality, but they were determined to make them come true.
As they talked, they noticed a group of children playing in the park. Alex and Emily walked over to them and asked if they needed any help. The children were excited to see someone new in the park, and they ran over to Alex and Emily, hugging them tightly.
Alex and Emily smiled at the children, feeling a sense of joy and happiness that they had never felt before. They knew that they had found something special in each other, and they were determined to make their love last.
As the night wore on, Alex and Emily found themselves lost in each other's arms. They had never felt so alive, so in love, and they knew that they would never forget this moment.
As the night came to a close, Alex and Emily stood up and walked back to their bench. They looked at each other, feeling a sense of gratitude and joy that they had never felt before. They knew that their love was worth fighting for, and they were determined to make it last.
From that moment on, Alex and Emily knew that their love was worth fighting for. They knew that they had found something special in each other, something that would last a lifetime. They knew that they would always be together, no matter what life threw their way.
And so, they sat on their bench, watching the world go by, knowing that their love was worth fighting for, and that they would always be together, no matter what the future held.
openai.chat.completions
response = client.chat.completions.create(model="",messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100)
print(response.choices[0].message.content)
Thank you for admiring my writing skills! Here's an example of how to use a greeting in a sentence:
Syntax sentence: "Excuse me, but can I have a moment of your time?"
Meaning: I am a friendly and polite person who is looking for brief conversation with someone else.
The response from the person in question could be: "Sure, let me give it a try."
**Imagery sentences
openai.completions
with Token Streaming
for chunk in client.completions.create(model="", prompt="tell me a short story", max_tokens=100, stream=True):
print(chunk.choices[0].text, end="", flush=True)
Authors have written ingenious stories about small country people, small towns, and small lifestyles. Here’s one that is light and entertaining:
Title: The Big Cheese of High School
Episode One: Annabelle, a sophomore in high school, has just missed the kiss-off that seemed like just a hiccup. However, Annabelle is a gentle soul, and her pendulum swings further out of control
openai.completions
client.completions.create(model="", prompt="tell me a short story", max_tokens=100).choices[0].text
" to keep me awake at night. - a quick story to put on hold till brighter times - How Loki's cylinder isn't meaningful anymore; remember that Loki is the lying one!\nthese last two sentences could be sophisticated supporting context sentences that emphasizes Loki's comedy presence - emphasize the exaggerated quality of Imogen's hyperactive relationships, and how she helps Loki to laugh - or if you want a plot"
Inference Requests to OpenAI-Compatible LLMs Deployed in Wallaroo via Wallaroo Inference OpenAI Endpoints
Native vLLM runtimes and Wallaroo Custom Models with OpenAI enabled perform inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions.
These requests require the following:
- A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
- Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
- Access to the deployed pipeline’s OpenAPI API extension endpoints.
For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:
{Deployment inference endpoint}/openai/v1/completions
: Compatible with the OpenAI API endpointcompletion
.{Deployment inference endpoint}/openai/v1/chat/completions
: Compatible with the OpenAI API endpointchat/completion
.
These endpoints use the OpenAI API parameters with the following conditions:
- Any OpenAI model customizations made are overridden by any parameters in the inference request.
- The
stream
parameter is available to provide token streaming.
When the stream
option is enabled to provide token streaming, the following apply:
- Outputs returned asynchronously as list of chunks aka as an event stream.
- The API inference call completes when all chunks are returned.
- The response metadata includes
ttft
,tps
and user-specified OpenAI request params after the last chunk is generated.
chat/completion Inference Token Streaming Example
The following demonstrates using OpenAI API compatible endpoint chat/completions
on a pipeline deployed in Wallaroo with token streaming enabled.
Note that the response when token streaming is enabled is returned asynchronously as a list of chunks.
curl -X POST \
-H "Authorization: Bearer abcdefg" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100, "stream": true}' \
https://api.example.wallaroo.ai/v1/api/pipelines/infer/sampleopenaipipeline-260/sampleopenaipipeline/openai/v1/chat/completions
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":"assistant","content":""}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"I"}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" am"}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" ec"}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"lect"}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"ic"}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"ils"}}],"usage":null}
...
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":"length","message":null,"delta":{"role":null,"content":","}}],"usage":null}
data: [DONE]
chat/completion Inference Example
The following example demonstrates performing an inference request using the deployed pipeline’s chat/completion
endpoint without token streaming enabled.
curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100}' \
https://api.example.wallaroo.ai/v1/api/pipelines/infer/sampleopenaipipeline-260/sampleopenaipipeline/openai/v1/chat/completions
{"choices":[{"delta":null,"finish_reason":"length","index":0,"message":{"content":"I am a storyteller. I strive to put words to my experiences and imaginations, telling stories that capture the heart and imagination of audiences around the world. Whether I'm sharing tales of adventure, hope, and love, or simply sharing the excitement of grand-kid opening presents on Christmas morning, I've always felt a deep calling to tell tales that inspire, uplift, and bring joy to those who hear them. From small beginn","role":"assistant","tool_calls":[]}}],"created":1748984273,"id":"chatcmpl-b26e7e82265f4e4287effe7d84914bf9","model":"vllm-openai_tinyllama.zip","object":"chat.completion","usage":{"completion_tokens":100,"prompt_tokens":49,"total_tokens":149,"tps":null,"ttft":null}}
completions Inference with Token Streaming Example
The following example demonstrates performing an inference request using the deployed pipeline’s completions
endpoint with token streaming enabled.
curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100, "stream": true}' \
https://api.example.wallaroo.ai/v1/api/pipelines/infer/sampleopenaipipeline-260/sampleopenaipipeline/openai/v1/completions
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" in","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" third","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" person","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" om","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":"nis","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
...
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[],"usage":{"prompt_tokens":27,"completion_tokens":100,"total_tokens":127,"ttft":0.023214041,"tps":93.92361686654164}}
data: [DONE]
completions Inference Example
The following example demonstrates performing an inference request using the deployed pipeline’s completions
endpoint without token streaming enabled.
curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100}' \
https://api.example.wallaroo.ai/v1/api/pipelines/infer/sampleopenaipipeline-260/sampleopenaipipeline/openai/v1/completions
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" about your summer vacation!\n\n- B - Inyl Convenience Store, Japan\n- Context: MUST BE SET IN AN AMERICAN SUMMER VACATION\n\nhow was your recent trip to japan?\n\n- A - On a cruise ship to Hawaii\n- Context: MUST START EVERY SENTENCE WITH \"How was your recent trip to\"\n\ndo you have any vacation plans for the summer?"}],"created":1748984246,"id":"cmpl-d93de2bad19f479c8a90bc00a5138092","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}
Inference Requests to OpenAI Compatible LLMs Deployed on Edge
Inference requests to edge deployed LLMs with OpenAI compatibilty are made to the following endpoints:
{hostname}/infer/openai/v1/completions
: Compatible with the OpenAI API endpointcompletions
.{hostname}/infer/openai/v1/chat/completions
: Compatible with the OpenAI API endpointchat/completions
.
These endpoints use the OpenAI API parameters with the following conditions:
- Any OpenAI model customizations made are overridden by any parameters in the inference request.
- The
stream
parameter is available to provide token streaming.
When the stream
option is enabled to provide token streaming, the following apply:
- Outputs returned asynchronously as list of chunks aka as an event stream.
- The API inference call completes when all chunks are returned.
- The response metadata includes
ttft
,tps
and user-specified OpenAI request params after the last chunk is generated.
The following example demonstrates performing an inference request on the endpoint completions
on a LLM with OpenAI compatibilty enabled deployed to an edge device.
curl -X POST \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100}' \
http://edge.sample.wallaroo.ai/infer/openai/v1/completions
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" about your summer vacation!\n\n- B - Inyl Convenience Store, Japan\n- Context: MUST BE SET IN AN AMERICAN SUMMER VACATION\n\nhow was your recent trip to japan?\n\n- A - On a cruise ship to Hawaii\n- Context: MUST START EVERY SENTENCE WITH \"How was your recent trip to\"\n\ndo you have any vacation plans for the summer?"}],"created":1748984246,"id":"cmpl-d93de2bad19f479c8a90bc00a5138092","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}
The following example demonstrates performing an inference request on the endpoint chat/completions
.
curl -X POST \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100}' \
http://edge.sample.wallaroo.ai/infer/openai/v1/chat/completions
{"choices":[{"delta":null,"finish_reason":"length","index":0,"message":{"content":"I am a storyteller. I strive to put words to my experiences and imaginations, telling stories that capture the heart and imagination of audiences around the world. Whether I'm sharing tales of adventure, hope, and love, or simply sharing the excitement of grand-kid opening presents on Christmas morning, I've always felt a deep calling to tell tales that inspire, uplift, and bring joy to those who hear them. From small beginn","role":"assistant","tool_calls":[]}}],"created":1748984273,"id":"chatcmpl-b26e7e82265f4e4287effe7d84914bf9","model":"vllm-openai_tinyllama.zip","object":"chat.completion","usage":{"completion_tokens":100,"prompt_tokens":49,"total_tokens":149,"tps":null,"ttft":null}}
How to Observe OpenAI API Enabled Inference Results Metrics
Inference results from native vLLM and Wallaroo Custom Model runtimes provides the metrics:
- Tracking time to first token (
ttft
) - Tokens per second (
tps
)
These results are provided in the Wallaroo Dashboard and the Wallaroo SDK inference logs.
Viewing OpenAI Metrics UI Through the Wallaroo Dashboard
The OpenAI API metrics ttft
and tps
are provided through the Wallaroo Dashboard Pipeline Inference Metrics and Logs page.
TTFT | TPS |
---|---|
![]() | ![]() |
To access the Wallaroo Dashboard Pipeline Inference Metrics and Logs page:
- Login to the Wallaroo Dashboard.
- Select the workspace the pipeline is associated with.
- Select View Pipelines.
- From the Workspace Pipeline List page, select the pipeline.
- From the Pipeline Details page, select Metrics.
Viewing OpenAI Metrics through the Wallaroo SDK
The OpenAI metrics are provided as part of the pipeline inference logs and include the following values:
ttft
tps
- The OpenAI request parameter values set during the inference request.
The method wallaroo.pipeline.Pipeline.logs
returns a pandas DataFrame by default, with the output fields labeled out.{field}
. For OpenAI inference requests, the OpenAI metrics output field is out.json
. The following demonstrates retrieving the most recent inference results log and displaying the out.json
field, which includes the tps
and ttft
fields.
pipeline.logs().iloc[-1]['out.json']
"{"choices":[{"delta":null,"finish_reason":null,"index":0,"message":{"content":"I am not capable of writing short stories, but I can provide you with a sample short story that follows the basic structure of a classic short story.\n\ntitle: the magic carpet\n\nsetting: a desert landscape\n\ncharacters:\n- abdul, a young boy\n- the magician, a wise old man\n- the carpet, a magical carpet made of gold and silver\n\nplot:\nabdul, a young boy, is wandering through the desert when he stumbles upon a magical carpet. The carpet is made of gold and silver, and it seems to have a magic power.\n\nabdul is fascinated by the carpet and decides to follow it. The carpet takes him on a magical journey, and he meets a group of animals who are also on a quest. Together, they encounter a dangerous dragon and a wise old owl who teaches them about the power of friendship and the importance of following one's dreams.\n\nas they journey on, the carpet takes abdul to a magical land filled with wonder and beauty. The land is filled with creatures that are unlike anything he has ever seen before, and he meets a group of magical beings who help him on his quest.\n\nfinally, abdul arrives at the throne of the king of the land, who has been waiting for him. The king is impressed by abdul's bravery and asks him to become his trusted servant.\n\nas abdul becomes the king's trusted servant, he learns the true meaning of friendship and the importance of following one's dreams. He returns home a changed man, with a newfound sense of purpose and a newfound love for the desert and its magic.\n\nconclusion:\nthe magic carpet is a classic short story that captures the imagination of readers with its vivid descriptions, magical elements, and heartwarming storyline. It teaches the importance of following one's dreams and the power of friendship, and its lessons continue to inspire generations of readers.","role":null}}],"created":1751310038,"id":"chatcmpl-a2893a0812e84cb696be1137681dcd85","model":"vllm-openai_tinyllama.zip","object":"chat.completion.chunk","usage":{"completion_tokens":457,"prompt_tokens":60,"total_tokens":517,"tps":94.23930050755882,"ttft":0.025588177}}"
Tutorials
Troubleshooting
OpenAI Inference Request without OpenAI Compatibility Enabled
- When sending an inference request to a Wallaroo inference pipeline endpoint using the Wallaroo SDK or API in the Wallaroo ops center or at the edge with OpenAI compatible payloads in the inference request, AND the underlying model does not have OpenAI configurations enabled,
- Then the following error message is displayed in Wallaroo:
"Inference failed. Please apply the appropriate OpenAI configurations to the models deployed in this pipeline. For additional help contact support@wallaroo.ai or your Wallaroo technical representative."
OpenAI Compatibility Enabled without OpenAI Inference Request
- When sending an inference request to a Wallaroo inference pipeline endpoint using the Wallaroo SDK or API in the Wallaroo ops center or at the edge, AND the underlying model does have Openai configurations enabled, AND inference endpoint request is missing the completion extensions,
- Then the following error message is displayed in Wallaroo:
"Inference failed. Please apply the appropriate OpenAI extensions to the inference endpoint. For additional help contact support@wallaroo.ai or your Wallaroo technical representative."