Deploy Llama with OpenAI Compatibility

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Deploy Llama with OpenAI compatibility

The following tutorial demonstrates deploying a Llama LLM in Wallaroo with OpenAI API compatibility enabled. This allows developers to:

Take advantage of Wallaroo’s inference optimization to increase inference response times with more efficient resource allocation.
Migrate existing OpenAI client code with a minimum of changes.

Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:

wallaroo.framework.Framework.VLLM: Native async vLLM implementations.
wallaroo.framework.Framework.CUSTOM: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.

A typical situation is to either deploy the native vLLM runtime as a single model in a Wallaroo pipeline, or both the Custom Model runtime and the native vLLM runtime together in the same pipeline to extend the LLMs capabilities.

This example uses one LLM with OpenAI compatibility enabled.

For access to these sample models and for a demonstration:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Tutorial Outline

This tutorial demonstrates how to:

Upload a LLM with the Wallaroo native vLLM framework.
Configure the uploaded LLM to enable OpenAI API compatibility and set additional OpenAI parameters.
Set resource configurations and deploy the LLM in Wallaroo.
Submit inference request via:
- The Wallaroo SDK methods completions and chat_completion
- Wallaroo pipeline inference urls with OpenAI API endpoints extensions.

Tutorial Requirements

The following tutorial requires the following:

Wallaroo version 2025.1 and above.
Tiny Llama model. This is available from Wallaroo representatives upon request.

Tutorial Steps

Import Libraries

The following libraries are used for this tutorial, primarily the Wallaroo SDK.

import wallaroo
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.openai_config import OpenaiConfig
import pyarrow as pa

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client(request_timeout=600)

Create and Set the Current Workspace

This steps creates the workspace. Uploaded LLMs and pipeline deployments are set within this workspace.

workspace = wl.get_workspace(name='vllm-openai-test', create_if_not_exist=True)
wl.set_current_workspace(workspace)

{'name': 'vllm-openai-test', 'id': 1689, 'archived': False, 'created_by': 'sample.user@wallaroo.ai', 'created_at': '2025-05-30T20:30:35.093295+00:00', 'models': [{'name': 'tinyllamaopenai', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 5, 30, 20, 32, 28, 757011, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 5, 30, 20, 32, 28, 757011, tzinfo=tzutc())}, {'name': 'tinyllamaopenaiyns1', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 6, 3, 0, 31, 49, 205332, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 6, 3, 0, 31, 49, 205332, tzinfo=tzutc())}, {'name': 'tinyllama', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 6, 3, 0, 34, 0, 798254, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 6, 3, 0, 34, 0, 798254, tzinfo=tzutc())}, {'name': 'ragstep1', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 6, 3, 1, 46, 47, 430142, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 6, 3, 1, 46, 47, 430142, tzinfo=tzutc())}, {'name': 'tinyllamaopenaiyns2', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 6, 16, 16, 36, 23, 762501, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 6, 16, 16, 36, 23, 762501, tzinfo=tzutc())}, {'name': 'tinyllamaopenaiyns-error', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 6, 23, 15, 2, 59, 581760, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 6, 23, 15, 2, 59, 581760, tzinfo=tzutc())}, {'name': 'tinyllamaopenaiyns-error1', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 6, 24, 18, 39, 42, 466411, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 6, 24, 18, 39, 42, 466411, tzinfo=tzutc())}, {'name': 'tinyllamarag', 'versions': 2, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 6, 27, 17, 43, 44, 446012, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 6, 3, 19, 25, 43, 437726, tzinfo=tzutc())}, {'name': 'ragstep', 'versions': 3, 'owner_id': '""', 'last_update_time': datetime.datetime(2025, 6, 27, 17, 49, 39, 238424, tzinfo=tzutc()), 'created_at': datetime.datetime(2025, 6, 3, 0, 37, 17, 945954, tzinfo=tzutc())}], 'pipelines': [{'name': 'tinyllama-openai', 'create_time': datetime.datetime(2025, 5, 30, 20, 40, 46, 518566, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'tinyllama-openai-error', 'create_time': datetime.datetime(2025, 6, 23, 15, 13, 57, 625524, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'tinyllama-openai-error1', 'create_time': datetime.datetime(2025, 6, 24, 18, 43, 42, 785405, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'tinyllama-openai-rag-cb', 'create_time': datetime.datetime(2025, 6, 4, 18, 15, 37, 345076, tzinfo=tzutc()), 'definition': '[]'}, {'name': 'tinyllama-openai-rag', 'create_time': datetime.datetime(2025, 6, 3, 0, 43, 13, 169150, tzinfo=tzutc()), 'definition': '[]'}]}

Upload the LLM

The model is uploaded with the following parameters:

The model name
The file path to the model
The framework set to Wallaroo native vLLM runtime: wallaroo.framework.Framework.VLLM
The input and output schemas are defined in Apache PyArrow format. For OpenAI compatibility, this is left as an empty List.
Acceleration is set to NVIDIA CUDA.

model_step = wl.upload_model(
    "tinyllamaopenaiyns1",
    "vllm-openai_tinyllama.zip",
    framework=Framework.VLLM,
    input_schema=pa.schema([]),
    output_schema=pa.schema([]),
    convert_wait=True,
    accel=Acceleration.CUDA
)

Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime...........................
Model is attempting loading to a container runtime...................................
Successful
Ready

Enable OpenAI Compatibility

OpenAI compatibility is enabled via the model configuration from the class wallaroo.openai_config.OpenaiConfig includes the following main parameters. The essential one is enabled - if OpenAI compatibility is not enabled, all other parameters are ignored.

Parameter	Type	Description
`enabled`	Boolean (Default: False)	If `True`, OpenAI compatibility is enabled. If `False`, OpenAI compatibility is not enabled. All other parameters are ignored if `enabled=False`.
`completion_config`	Dict	The OpenAI API `completion` parameters. All `completion` parameters are available except `stream`; the `stream` parameter is only set at inference requests.
`chat_completion_config`	Dict	The OpenAI API `chat/completion` parameters. All `completion` parameters are available except `stream`; the `stream` parameter is only set at inference requests.

With the OpenaiConfig object defined, it is when applied to the LLM configuration through the openai_config parameter.

# Configuring as OpenAI

openai_config = OpenaiConfig(enabled=True, chat_completion_config={"temperature": .3, "max_tokens": 200})
model_step = model_step.configure(openai_config=openai_config)

Set the Deployment Configuration and Deploy

The deployment configuration defines what resources are allocated to the LLM’s exclusive use. For this tutorial, the LLM is allocated:

1 cpu
8 Gi RAM
1 GPU. The GPU type is inherited from the model upload step.
The deployment label with the GPU resources used.

Once the deployment configuration is set:

The pipeline is created and the LLM added as a pipeline step.
The pipeline is deployed with the deployment configuration.

Once the deployment is complete, the LLM is ready to receive inference requests.

# Deploying

deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(model_step, 1) \
    .sidekick_memory(model_step, '8Gi') \
    .sidekick_gpus(model_step, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

pipeline = wl.build_pipeline('tinyllama-openai')
pipeline.clear()
pipeline.add_model_step(model_step)
pipeline.deploy(deployment_config = deployment_config)

Inference Requests on LLM with OpenAI Compatibility Enabled

Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom with OpenAI compatibility enabled in Wallaroo are performed either through the Wallaroo SDK, or via OpenAPI endpoint requests.

OpenAI API inference requests on models deployed with OpenAI compatibility enabled have the following conditions:

Parameters for chat/completion and completion override the existing OpenAI configuration options.
If the stream option is enabled:
- Outputs returned as list of chunks aka as an event stream.
- The request inference call completes when all chunks are returned.
- The response metadata includes ttft, tps and user-specified OpenAI request params after the last chunk is generated.

OpenAI API Inference Requests via the Wallaroo SDK

Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:

wallaroo.pipeline.Pipeline.openai_chat_completion: Submits an inference request using the OpenAI API chat/completion endpoint parameters.
wallaroo.pipeline.Pipeline.openai_completion: Submits an inference request using the OpenAI API completion endpoint parameters.

The following demonstrates performing an inference request using openai_chat_completion.

# Performing completions

pipeline.openai_chat_completion(messages=[{"role": "user", "content": "good morning"}]).choices[0].message.content

"Of course! Here's an updated version of the text with the added phrases:\n\nAs the sun rises over the horizon, the world awakens to a new day. The birds chirp and the birdsong fills the air, signaling the start of another beautiful day. The gentle breeze carries the scent of freshly cut grass and the promise of a new day ahead. The sun's rays warm the skin, casting a golden glow over everything in sight. The world awakens to a new day, a new chapter, a new beginning. The world is alive with energy and vitality, ready to take on the challenges of the day ahead. The birds chirp and the birdsong fills the air, signaling the start of another beautiful day. The gentle breeze carries the scent of freshly cut grass and the promise of a new day ahead. The sun's rays warm the skin"

The following demonstrates performing an inference request using openai_completion

pipeline.openai_completion(prompt="tell me about wallaroo.AI", max_tokens=200).choices[0].text

', any first-person shooter game. Wallaroo is a comprehensive platform for building and tracking predictive models. This tool is really helpful in AI development. Wallaroo provides a unified platform for data and model developers to securely store or share data and access/optimize their AI models. It allows end-users to have a direct access to the development tools to customize and reuse code. Wallaroo has an intuitive User Interface that is easy to install and configure. Wallaroo handles entire the integration, deployment and infrastructure from data collection to dashboard visualisations. Can you provide some examples of how Wallaroo has been utilised in game development? Also, talk about the effectiveness of ML training using Wallaroo.'

The following demonstrates performing an inference request using openai_chat_completion with token streaming enabled.

# Now with streaming

for chunk in pipeline.openai_chat_completion(messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=100, stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)

Once upon a time, in a small village nestled in the heart of the countryside, there lived a young woman named Lily. Lily was a kind and gentle soul, always looking out for those in need. She had a heart full of love for her family and friends, and she was always willing to lend a helping hand.

One day, Lily met a handsome young man named Jack. Jack was a charming and handsome man, with a

The following demonstrates performing an inference request using openai_completion with token streaming enabled.

# Now with streaming

for chunk in pipeline.openai_completion(prompt="tell me a short story", max_tokens=300, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

?" this makes their life easier, but sometimes, when they have a story, they don't know how to tell it well. This frustrates them and makes their life even more difficult.

b. Relaxation:
protagonist: take a deep breath and let it out. Why not start with a song? "Eyes full of longing, I need your music to embrace." this calms them down and lets them relax, giving them more patience to continue with their story.

c. Inspirational quotes:
protagonist: this quote from might jeffries helps me reflect on my beliefs and values: "the mind is a powerful thing, it can change your destiny at any time. Fear no fear, only trust your divineline and reclaim your destiny." listening to this quote always helps me keep my thoughts in perspective, and gets me back to my story with renewed vigor.

OpenAI API Inference Requests via the OpenAPI Client Endpoints

Wallaroo deployed pipelines provide a deployment inference URL for inference requests via API methods. Pipelines deployed with LLMs with OpenAI API compatibility add OpenAI extensions to the inference URL for direct inference requests.

The following examples demonstrate performing inference requests through the deployed pipeline’s OpenAI API compatibility endpoint extensions.

Note that the command token = wl.auth.auth_header()['Authorization'].split()[1] retrieves the authentication token used to authenticate to Wallaroo before performing the inference request via API calls.

Connect to the OpenAI API Endpoint

The following command connects the OpenAI client to the deployed pipeline’s OpenAI endpoint.

# Now using the OpenAI client

token = wl.auth.auth_header()['Authorization'].split()[1]

from openai import OpenAI
client = OpenAI(
    base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-414/tinyllama-openai/openai/v1',
    api_key=token
)

OpenAI API Inference Request Examples

The following demonstrates performing an inference request using the chat/completions endpoint with token streaming enabled.

for chunk in client.chat.completions.create(model="dummy", 
                                            messages=[{"role": "user", "content": "this is a short story about love"}], 
                                            max_tokens=1000, 
                                            stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)

It was a warm summer evening, and the sun was setting over the city. A young couple, Alex and Emily, had just walked out of a coffee shop, hand in hand. They were laughing and chatting, enjoying the last few moments of their day.

As they walked down the street, Alex turned to Emily and said, "I love you, Emily."

Emily's eyes widened in surprise, and she smiled. "I love you too, Alex."

They walked for a few more blocks, hand in hand, and finally, they arrived at a park. They sat down on a bench, and Alex took Emily's hand.

"I know this is a little sudden," Alex said, "but I feel like we've been together for a while now. I want to spend the rest of my life with you."

Emily looked at him, her eyes filled with tears. "I feel the same way, Alex. I love you more than anything in this world."

They sat there, holding hands, for what felt like hours. They talked about everything and anything, their hearts beating in unison.

As the sun began to set, Alex and Emily stood up, and they walked back to the coffee shop. They hugged each other tightly, tears streaming down their faces.

"I love you, Alex," Emily said, her voice shaking.

Alex smiled, and he said, "I love you too, Emily."

They walked back to their apartment, hand in hand, and spent the rest of the night talking and laughing.

Over the next few weeks, Alex and Emily's love grew stronger. They spent every moment they could together, exploring the city, going on walks, and enjoying each other's company.

One day, they decided to take a walk in a nearby park. As they walked, they talked about everything and anything, their hearts beating in unison.

As they reached the end of the path, Alex turned to Emily and said, "I love you, Emily. I know we've only been together for a short time, but I feel like we've been through so much together. I want to spend the rest of my life with you."

Emily looked at him, her eyes filled with tears. "I feel the same way, Alex. I love you more than anything in this world."

They stood there, holding hands, and Alex said, "I love you, Emily. I want to spend the rest of my life with you."

Emily smiled, tears streaming down her face. "I love you too, Alex. I know this is a little sudden, but I feel like we've been together for a while now. I want to spend the rest of my life with you."

They walked back to their apartment, hand in hand, and spent the rest of the night talking and laughing.

From that day on, Alex and Emily's love grew stronger. They spent every moment they could together, exploring the city, going on walks, and enjoying each other's company.

Years later, they were married, and they had two beautiful children. They knew that their love had been a little sudden, but they knew that it was worth it. They knew that they had found each other, and they knew that they would spend the rest of their lives together, loving and cherishing each other.

The following demonstrates performing an inference request using the completions endpoint with token streaming enabled.

for chunk in client.completions.create(model="dummy", prompt="tell me about wallaroo.AI", max_tokens=1000, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

's robotic fabrication technology: can you provide some examples of products that have been milled using wallaroo’s robots?

Publish Pipeline for Edge Deployment

Wallaroo pipelines are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config) command. This uploads the following artifacts to the OCI registry:

The native vLLM runtimes or custom models with OpenAI compatibility enabled.
If specified, the deployment configuration.
The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.

Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.

The following demonstrates publishing the RAG Llama pipeline created and tested in the previous steps. Once published, it can be deployed to edge locations with the required resources matching the deployment configuration.

pipeline.publish(deployment_config)

Waiting for pipeline publish... It may take up to 600 sec.
................................. Published.

ID 72

Pipeline Name tinyllama-openai

Pipeline Version 56b2cebd-fdc7-4c68-a081-837585df6a61

Status Published

Workspace Id 1689

Workspace Name vllm-openai-test

Edges

Engine URL sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-6232

Pipeline URL sample.registry.example.com/uat/pipelines/tinyllama-openai:56b2cebd-fdc7-4c68-a081-837585df6a61

Helm Chart URL oci://sample.registry.example.com/uat/charts/tinyllama-openai

Helm Chart Reference sample.registry.example.com/uat/charts@sha256:74fe1ea0410b4dad90dbda3db10904728c5a5c0c2ea2b60d0d8e889e4617347b

Helm Chart Version 0.0.1-56b2cebd-fdc7-4c68-a081-837585df6a61

Engine Config {'engine': {'resources': {'limits': {'cpu': 0.5, 'memory': '1Gi'}, 'requests': {'cpu': 0.5, 'memory': '1Gi'}, 'accel': 'cuda', 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none', 'cpu_utilization': 50.0}, 'images': {'tinyllamaopenaiyns1-766': {'resources': {'limits': {'cpu': 1.0, 'memory': '8Gi'}, 'requests': {'cpu': 1.0, 'memory': '8Gi'}, 'accel': 'none', 'arch': 'x86', 'gpu': True}}}}}

User Images []

Created By john.hummel@wallaroo.ai

Created At 2025-07-01 17:27:42.112064+00:00

Updated At 2025-07-01 17:27:42.112064+00:00

Replaces

Docker Run Command

docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=sample.registry.example.com/uat/pipelines/tinyllama-openai:56b2cebd-fdc7-4c68-a081-837585df6a61 \
    -e CONFIG_CPUS=1.0 --gpus all --cpus=1.5 --memory=9g \
    sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini-cuda:v2025.1.0-6232

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.

Helm Install Command

helm install --atomic $HELM_INSTALL_NAME \
    oci://sample.registry.example.com/uat/charts/tinyllama-openai \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-56b2cebd-fdc7-4c68-a081-837585df6a61 \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

Undeploy

With the tutorial complete, the pipeline is undeployed to return the resources from the LLM’s exclusive use.

pipeline.undeploy()

Waiting for undeployment - this will take up to 600s .................................... ok

name	tinyllama-openai
created	2025-05-30 20:40:46.518566+00:00
last_updated	2025-05-30 21:15:17.806262+00:00
deployed	False
workspace_id	1689
workspace_name	vllm-openai-test
arch	x86
accel	cuda
tags
versions	c594f433-eaa7-45d9-903a-270314c1e3aa, 0017f356-8104-4708-ad73-70b5f93201d1, 239cf2e0-7e2c-4fe1-95fd-39aeacc559e8, e0337b32-0ff2-43e7-86b6-ecce344e326c
steps	tinyllamaopenai
published	False

For access to these sample models and for a demonstration:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today