Deploy RAG LLM with OpenAI Compatibility

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Deploy RAG LLM with OpenAI Compatibility

The following tutorial demonstrates deploying a Llama LLM with Retrieval-Augmented Generation (RAG) in Wallaroo with OpenAI API compatibility enabled. This allows developers to:

Take advantage of Wallaroo’s inference optimization to increase inference response times with more efficient resource allocation.
Migrate existing OpenAI client code with a minimum of changes.
Extend their LLMs capabilities with the Wallaroo Custom Model framework to add RAG functionality to an existing LLM.

Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:

wallaroo.framework.Framework.VLLM: Native async vLLM implementations.
wallaroo.framework.Framework.CUSTOM: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.

A typical situation is to either deploy the native vLLM runtime as a single model in a Wallaroo pipeline, or both the Custom Model runtime and the native vLLM runtime together in the same pipeline to extend the LLMs capabilities. In this tutorial, RAG is added to improve the context of inference requests to provide better responses and prevent AI hallucinations.

This example uses one model for RAG, and one LLM with OpenAI compatibility enabled.

For access to these sample models and for a demonstration:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today

Tutorial Outline

This tutorial demonstrates how to:

Upload a LLM with the Wallaroo native vLLM framework and a Wallaroo Custom Model with the Custom Model framework.
Configure the uploaded LLM to enable OpenAI API compatibility and set additional OpenAI parameters.
Set resource configurations for allocating cpus, memory, etc.
Set the Custom Model runtime and native vLLM runtime as pipeline steps and deploy in Wallaroo.
Submit inference request via:
- The Wallaroo SDK methods completions and chat_completion
- Wallaroo pipeline inference urls with OpenAI API endpoints extensions.

Tutorial Requirements

The following tutorial requires the following:

Wallaroo version 2025.1 and above.
Tiny Llama model and the Wallaroo RAG Custom Model. These are available from Wallaroo representatives upon request.

Tutorial Steps

Import Libraries

The following libraries are used for this tutorial, primarily the Wallaroo SDK.

import wallaroo
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.openai_config import OpenaiConfig
import pyarrow as pa

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client(request_timeout=600)

Create and Set the Current Workspace

This steps creates the workspace. Uploaded LLMs and pipeline deployments are set within this workspace.

workspace = wl.get_workspace(name='vllm-openai-test', create_if_not_exist=True)
wl.set_current_workspace(workspace)

Upload the LLM and Custom Model

The model is uploaded with the following parameters:

The model name
The file path to the model
The framework set to Wallaroo native vLLM runtime: wallaroo.framework.Framework.VLLM
The input and output schemas are defined in Apache PyArrow format. For OpenAI compatibility, this is left as an empty List.
Acceleration is set to NVIDIA CUDA for the LLM.

# Uploading the model

model_step = wl.upload_model(
    "tinyllamarag",
    "vllm-openai_tinyllama.zip",
    framework=Framework.VLLM,
    input_schema=pa.schema([]),
    output_schema=pa.schema([]),
    convert_wait=True,
    accel=Acceleration.CUDA
)

Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime.................................
Model is attempting loading to a container runtime..........................
Successful
Ready

model_step=wl.get_model("tinyllamarag")
model_step

Name	tinyllamarag
Version	a7400b8e-bd7f-4982-8eb8-ab3b477b0ab7
File Name	vllm-openai_tinyllama.zip
SHA	db68af9c290cdc8d047b7ac70f5acbd446435d2767ac4dfd51509b750a78bdd0
Status	ready
Image Path	proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-6174
Architecture	x86
Acceleration	cuda
Updated At	2025-03-Jun 19:30:37
Workspace id	1689
Workspace name	vllm-openai-test

# Configuring as OpenAI

openai_config = OpenaiConfig(enabled=True)
model_step = model_step.configure(openai_config=openai_config)

# Uploading the model

rag_step = wl.upload_model(
    "ragstep",
    "openai_step.zip",
    framework=Framework.CUSTOM,
    input_schema=pa.schema([]),
    output_schema=pa.schema([]),
    convert_wait=True
)

Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime..
Model is attempting loading to a container runtime.......................
Successful
Ready

rag_step=wl.get_model("ragstep")
rag_step

Name	ragstep
Version	043a4831-22ed-4b51-95f0-ed8cc5511a59
File Name	openai_step.zip
SHA	6f5c95e524da0a28e813dc70e81c46454f7b4594d35a23405a7c6438d7c01a29
Status	ready
Image Path	proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-6174
Architecture	x86
Acceleration	none
Updated At	2025-03-Jun 19:32:45
Workspace id	1689
Workspace name	vllm-openai-test

Enable OpenAI Compatibility

OpenAI compatibility is enabled via the model configuration from the class wallaroo.openai_config.OpenaiConfig includes the following main parameters. The essential one is enabled - if OpenAI compatibility is not enabled, all other parameters are ignored.

Parameter	Type	Description
`enabled`	Boolean (Default: False)	If `True`, OpenAI compatibility is enabled. If `False`, OpenAI compatibility is not enabled. All other parameters are ignored if `enabled=False`.
`completion_config`	Dict	The OpenAI API `completion` parameters. All `completion` parameters are available except `stream`; the `stream` parameter is only set at inference requests.
`chat_completion_config`	Dict	The OpenAI API `chat/completion` parameters. All `completion` parameters are available except `stream`; the `stream` parameter is only set at inference requests.

With the OpenaiConfig object defined, it is when applied to the LLM configuration through the openai_config parameter.

openai_config = OpenaiConfig(enabled=True)
rag_step = rag_step.configure(openai_config=openai_config)

Set the Deployment Configuration and Deploy

The deployment configuration defines what resources are allocated to the LLM’s exclusive use. For this tutorial, the LLM is allocated:

Llama LLM:
- 1 cpu
- 8 Gi RAM
- 1 GPU. The GPU type is inherited from the model upload step. For QAIC, each deployment configuration gpu values is the number of System-on-Chip (SoC) to use.
RAG Model:
- 1 cpu
- 2 Gi RAM

Once the deployment configuration is set:

The pipeline is created.
The RAG model and the LLM added as a pipeline steps.
The pipeline is deployed with the deployment configuration.

Once the deployment is complete, the LLM is ready to receive inference requests.

# Deploying

deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(rag_step, 1) \
    .sidekick_memory(rag_step, '2Gi') \
    .sidekick_cpus(model_step, 1) \
    .sidekick_memory(model_step, '8Gi') \
    .sidekick_gpus(model_step, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

pipeline = wl.build_pipeline('tinyllama-openai-rag')
pipeline.undeploy()
pipeline.clear()
pipeline.add_model_step(rag_step)
pipeline.add_model_step(model_step)
pipeline.deploy(deployment_config = deployment_config)

Waiting for undeployment - this will take up to 600s ................................... ok
Waiting for deployment - this will take up to 600s .................................................................................................................................. ok

name	tinyllama-openai-rag
created	2025-06-03 00:43:13.169150+00:00
last_updated	2025-06-04 00:25:24.221145+00:00
deployed	True
workspace_id	1689
workspace_name	vllm-openai-test
arch	x86
accel	none
tags
versions	ca7ac20b-22a8-44ea-8897-31588cd4f4a1, 761bff73-d14c-40a6-9002-2ccef283412a, 18366faf-552e-46de-aa7e-b246c8d030a9, 83cd5a81-ba4f-431d-98f2-b6027a48aa29, a670f88f-e0cc-49a1-a7b5-d0d693cd372a, d6494410-d99e-438e-a6af-9fd381595ec6, 836c73d2-dcb5-4733-995b-bfb3cd3b5511, 1cb296d0-9e4c-45e3-a72a-6692ab279766, 082fc1b8-bedb-49ed-9ad9-dde8ed9549bd, b088f896-f29a-4ad1-a96c-c2920ab2b817, f21ed918-5681-45cd-a4fd-60a7b124c6d5, 23a26672-2ffb-410e-aa65-b4c949320700, d5517bcd-60a1-4964-ad58-87c038e9267b, 2e689953-1883-4300-8c11-7b854fe23c43, 7092df94-4ecd-4a82-8f2c-1c2577cd1641, 73040edb-ba7b-4b19-aedf-91d47f59c5cf
steps	ragstep
published	True

Inference Requests on LLM with OpenAI Compatibility Enabled

Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom with OpenAI compatibility enabled in Wallaroo are performed either through the Wallaroo SDK, or via OpenAPI endpoint requests.

OpenAI API inference requests on models deployed with OpenAI compatibility enabled have the following conditions:

Parameters for chat/completion and completion override the existing OpenAI configuration options.
If the stream option is enabled:
- Outputs returned as list of chunks aka as an event stream.
- The request inference call completes when all chunks are returned.
- The response metadata includes ttft, tps and user-specified OpenAI request params after the last chunk is generated.

OpenAI API Inference Requests via the Wallaroo SDK and Inference Result Logs

Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:

wallaroo.pipeline.Pipeline.openai_chat_completion: Submits an inference request using the OpenAI API chat/completion endpoint parameters.
wallaroo.pipeline.Pipeline.openai_completion: Submits an inference request using the OpenAI API completion endpoint parameters.

The OpenAI metrics are provided as part of the pipeline inference logs and include the following values:

ttft
tps
The OpenAI request parameter values set during the inference request.

The method wallaroo.pipeline.Pipeline.logs returns a pandas DataFrame by default, with the output fields labeled out.{field}. For OpenAI inference requests, the OpenAI metrics output field is out.json. The following demonstrates retrieving the inference results log and displaying the out.json field, which includes the tps and ttft fields.

OpenAI API Inference Requests via Pipeline Deployment URLs with OpenAI Extensions

Native vLLM runtimes and Wallaroo Custom Models with OpenAI enabled perform inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions. For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:

{Deployment inference endpoint}/openai/v1/completions: Compatible with the OpenAI API endpoint completion.
{Deployment inference endpoint}/openai/v1/chat/completions: Compatible with the OpenAI API endpoint chat/completion.

These requests require the following:

A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
Access to the deployed pipeline’s OpenAPI API endpoints.

Inference and Inference Results Logs Examples

The following demonstrates performing an inference request using openai_chat_completion with token streaming enabled.

# Now with streaming. "Howdy" should appear in most responses
while True:
    print('\n----\n', flush=True)
    for chunk in pipeline.openai_chat_completion(messages=[
        {"role": "user", "content": "you are a good short story teller"}
    ], max_tokens=1000, stream=True):
        print(chunk.choices[0].delta.content, end="", flush=True)


----

I am not capable of expressing myself as emotionally as a person can, but I can provide you with a sample short story. The short story follows the steps of how "howdy!" will be the opening line of a meaningful conversation.

1. Introduction: The scene opens in a lobby of a bustling tourist destination. A group of travelers have gathered around a group of locals. They exchange pleasantries and begin introducing themselves. One of the travelers, let's name her sarah, speaks up.

"Hi, I'm sarah. How's your trip going?"

The locals respond, "Exceptional. Thanks for joining us!"

2. Excitement: Sarah follows up with a "Howdy!" to the locals. This echoes the foundational principle of starting every conversation with "howdy." The locals exchange a short greeting, "Howdy," in response.

3. Conversation: Meanwhile, the travelers continue chatting with the locals. Eventually, they make their way to their hotel room. The locals continue to chat during the hotel ride, asking sarah about her plans for the weekend.

4. Comfort and Connection: Sarah is fascinated with the locals' ability to strike up long-lasting conversations. She empathizes with the locals' tendency to connect with strangers.

5. Progress: The locals continue their conversation with sarah. They discuss their hobbies, lifestyle, and share stories about their family and friends.

6. Dilemma: As they continue their conversation, sarah notices the locals becoming frustrated as they can't seem to string together useful questions. She reflects on how comfortable she felt with strangers during her travels.

7. Change in Viewpoint: Sarah realizes that she's not so intimidated by strangers, and she starts opening up too. She talks about her hobbies as well as her hopes for the future.

8. Shared Passion: Sarah shares about her love for taking pictures, which prompts the locals to chime in. She sees their shared passion for adventure and shares her excitement to explore new places.

9. Closure: Sarah shares the origin of this concept. She's from a country where the locals often say "howdy" to visitors. She explains how exciting this simple shorthand has been for her.

10. Conclusion: Sarah returns to her hotel room, grateful for the unique conversation experience. She knew that there were a lot of similarities between her experiences and the locals' connection-building.

succinctly, sarah explains that "howdy," is a powerful strategy for starting any meaningful conversation with a new person. The story demonstrates how different perspectives can unlock meaningful conversations by sharing stories and building trust.

The following demonstrates retrieving the inference result logs for the recent openai_chat_completion with token streaming enabled request.

from datetime import datetime, timedelta
pipeline.logs(start_datetime=datetime.now() - timedelta(minutes=5), end_datetime=datetime.now()).iloc[-1]['out.json']

'{"choices":[{"delta":null,"finish_reason":null,"index":0,"message":{"content":"I am not capable of expressing myself as emotionally as a person can, but I can provide you with a sample short story. The short story follows the steps of how \\"howdy!\\" will be the opening line of a meaningful conversation.\\n\\n1. Introduction: The scene opens in a lobby of a bustling tourist destination. A group of travelers have gathered around a group of locals. They exchange pleasantries and begin introducing themselves. One of the travelers, let\'s name her sarah, speaks up.\\n\\n\\"Hi, I\'m sarah. How\'s your trip going?\\"\\n\\nThe locals respond, \\"Exceptional. Thanks for joining us!\\"\\n\\n2. Excitement: Sarah follows up with a \\"Howdy!\\" to the locals. This echoes the foundational principle of starting every conversation with \\"howdy.\\" The locals exchange a short greeting, \\"Howdy,\\" in response.\\n\\n3. Conversation: Meanwhile, the travelers continue chatting with the locals. Eventually, they make their way to their hotel room. The locals continue to chat during the hotel ride, asking sarah about her plans for the weekend.\\n\\n4. Comfort and Connection: Sarah is fascinated with the locals\' ability to strike up long-lasting conversations. She empathizes with the locals\' tendency to connect with strangers.\\n\\n5. Progress: The locals continue their conversation with sarah. They discuss their hobbies, lifestyle, and share stories about their family and friends.\\n\\n6. Dilemma: As they continue their conversation, sarah notices the locals becoming frustrated as they can\'t seem to string together useful questions. She reflects on how comfortable she felt with strangers during her travels.\\n\\n7. Change in Viewpoint: Sarah realizes that she\'s not so intimidated by strangers, and she starts opening up too. She talks about her hobbies as well as her hopes for the future.\\n\\n8. Shared Passion: Sarah shares about her love for taking pictures, which prompts the locals to chime in. She sees their shared passion for adventure and shares her excitement to explore new places.\\n\\n9. Closure: Sarah shares the origin of this concept. She\'s from a country where the locals often say \\"howdy\\" to visitors. She explains how exciting this simple shorthand has been for her.\\n\\n10. Conclusion: Sarah returns to her hotel room, grateful for the unique conversation experience. She knew that there were a lot of similarities between her experiences and the locals\' connection-building.\\n\\nsuccinctly, sarah explains that \\"howdy,\\" is a powerful strategy for starting any meaningful conversation with a new person. The story demonstrates how different perspectives can unlock meaningful conversations by sharing stories and building trust.","role":null}}],"created":1748996917,"id":"chatcmpl-199eff3baf834668a29f4ad777d58178","model":"vllm-openai_tinyllama.zip","object":"chat.completion.chunk","usage":{"completion_tokens":639,"prompt_tokens":51,"total_tokens":690,"tps":94.19393440980764,"ttft":0.020341099}}'

The following command connects the OpenAI client to the deployed pipeline’s OpenAI endpoint.

# Now using the OpenAI client

token = wl.auth.auth_header()['Authorization'].split()[1]

from openai import OpenAI
client = OpenAI(
    base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1',
    api_key=token
)

The following demonstrates performing an inference request using the OpenAI API completions endpoint.

client.completions.create(model="", prompt="tell me a short story", max_tokens=100).choices[0].text

" to keep me awake at night. - a quick story to put on hold till brighter times - How Loki's cylinder isn't meaningful anymore; remember that Loki is the lying one!\nthese last two sentences could be sophisticated supporting context sentences that emphasizes Loki's comedy presence - emphasize the exaggerated quality of Imogen's hyperactive relationships, and how she helps Loki to laugh - or if you want a plot"

The following demonstrates retrieving the inference result logs for the recent OpenAI API completions endpoint.

pipeline.logs(start_datetime=datetime.now() - timedelta(minutes=3), end_datetime=datetime.now()).iloc[-1]['out.json']

'{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" to keep me awake at night. - a quick story to put on hold till brighter times - How Loki\'s cylinder isn\'t meaningful anymore; remember that Loki is the lying one!\\nthese last two sentences could be sophisticated supporting context sentences that emphasizes Loki\'s comedy presence - emphasize the exaggerated quality of Imogen\'s hyperactive relationships, and how she helps Loki to laugh - or if you want a plot"}],"created":1748997757,"id":"cmpl-6eaeb190a246424a80b30256ce5716f2","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}'

The following demonstrates performing an inference request using the Wallaroo SDK openai_completion method with token streaming enabled.

# Streaming: Completion
for chunk in pipeline.openai_completion(prompt="tell me a short story", max_tokens=1000, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

 of a time when there was an emergency that required me to rush to help someone in a foreign country. Make sure the story is engaging, appropriate for a tourist's ability to comprehend, and includes exaurding sensory details of your experience. Include conversion rates for the emergency response/sMS service you use, if applicable. Use proper grammar and appropriate writing style throughout.

The following demonstrates performing an inference request using the Wallaroo SDK openai_chat_completion with token streaming enabled.

# Streaming: Chat completion
for chunk in pipeline.openai_chat_completion(messages=[
        {"role": "user", "content": "you are a story teller"}
    ],
    max_tokens=100,
    stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)

I appreciate that you find my stories intriguing, but I am not the ones who are telling stories. Instead, I am just a software that processes and interprets patterns of language based on spoken remarks. Stories are mostly the result of imaginative and creative thinking, and anyone can create excellent stories. I am just an intermediate bridge between an original works and the ones enjoyed by millions of people around the world.

that being said, let me present you with an original story

The following performs an inference request using the Wallaroo SDK method openai_completion.

# Non-streaming: Completion
response = pipeline.openai_completion(prompt="tell me a short story", max_tokens=100)
print(response.choices[0].text)

 to wrap up the meeting

Investigation: Sometimes quite fascinated by the series of accidents in the factory, the safety inspectors decided to investigate.

Context: END SENTENCE WITH "I investigated... Time."

the investigation was a protracted tribulation I was involved with

Investigation: Did the investigators really think that my time-consuming investigation would lead to a fluctuation in the factory's production figures

The following performs an inference request using the Wallaroo SDK method openai_chat_completion.

# Non-streaming: Chat Completion
response = pipeline.openai_chat_completion(messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100)
print(response.choices[0].message.content)

Of course, I'm a storyteller! But not all stories are meant to be seriolosely presented. I appreciate your compliment, and it's always exciting when I can twist a well-known story in a new and unexpected way. Here's a fictional tale inspired by Buffy the Vampire Slayer:

Title: "The Lost Buffy Book"

Introduction: 
"The Loners, a killer

The following command connects the OpenAI client to the deployed pipeline’s OpenAI endpoint.

######## OpenAI Client #########

token = wl.auth.auth_header()['Authorization'].split()[1]

from openai import OpenAI
client = OpenAI(
    base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1',
    api_key=token
)

The following demonstrates performing an inference request using the completions endpoint with token streaming enabled.

# Streaming: Completion
for chunk in client.completions.create(model="", prompt="tell me a short story", max_tokens=100, stream=True):
    print(chunk.choices[0].text, end="", flush=True)

 about a person who discovered a hidden talent

Voiceover: "Here's a true story about a guy named Jack."

[Intro Film Clip: Cut to an intimate shot of a character sitting in a cozy living room, reading a book]

Narrator: "Jack was sitting across from his beloved book, deep in thought. You see, Jack had always been content on his day-to-day life as an account

The following demonstrates performing an inference request using the chat/completions endpoint with token streaming enabled.

# Streaming: Chat completion
for chunk in client.chat.completions.create(model="dummy", messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100, stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)

I am not capable of true storytelling like humans. However, I can definitely help you compose a compelling story. When it comes to using "howdy!" as the starting sentence in your story, here are some tips that might help:

1. Make it first-person: Instead of starting your story with a third-person narrator, start with 'I was going', focusing on your own experience of starting the day off on the right foot. This will make your

The following demonstrates performing an inference request using the completions endpoint.

# Non-streaming: Completion
response = client.completions.create(model="whatever", prompt="tell me a short story", max_tokens=100)
print(response.choices[0].text)

 Authors have written ingenious stories about small country people, small towns, and small lifestyles. Here’s one that is light and entertaining:

Title: The Big Cheese of High School

Episode One: Annabelle, a sophomore in high school, has just missed the kiss-off that seemed like just a hiccup. However, Annabelle is a gentle soul, and her pendulum swings further out of control

The following demonstrates performing an inference request using the chat/completions endpoint.

# Non-streaming: Chat completion
response = client.chat.completions.create(model="",messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100)
print(response.choices[0].message.content)

Thank you for admiring my writing skills! Here's an example of how to use a greeting in a sentence:

Syntax sentence: "Excuse me, but can I have a moment of your time?"

Meaning: I am a friendly and polite person who is looking for brief conversation with someone else.

The response from the person in question could be: "Sure, let me give it a try."

**Imagery sentences

The following command demonstrates using the Wallaroo SDK to retrieve the authentication bearer token. This is used to authenticate for making Wallaroo API calls. For more details, see Wallaroo API Connection Guide.

token = wl.auth.auth_header()['Authorization'].split()[1]
token

'abc123'

The following demonstrates performing an inference request using deployed pipeline OpenAI endpoint the completions endpoint with token streaming enabled.

# Streaming: Completion
!curl -X POST \
  -H "Authorization: Bearer abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100, "stream": true}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/completions

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" in","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" third","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" person","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

...

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" reader","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" to","index":0,"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[],"usage":{"prompt_tokens":27,"completion_tokens":100,"total_tokens":127,"ttft":0.023214041,"tps":93.92361686654164}}

data: [DONE]

The following demonstrates performing an inference request using deployed pipeline OpenAI endpoint the chat/completions endpoint with token streaming enabled.

# Streaming: Chat completion
!curl -X POST \
  -H "Authorization: Bearer abc123"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100, "stream": true}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/chat/completions

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":"assistant","content":""}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"I"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" am"}}],"usage":null}


...

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"ils"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":"length","message":null,"delta":{"role":null,"content":","}}],"usage":null}

data: [DONE]

The following demonstrates performing an inference request using deployed pipeline OpenAI extended endpoint completions.

# Non-streaming: Completion
!curl -X POST \
  -H "Authorization: Bearer abc123"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/completions

{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" about your summer vacation!\n\n- B - Inyl Convenience Store, Japan\n- Context: MUST BE SET IN AN AMERICAN SUMMER VACATION\n\nhow was your recent trip to japan?\n\n- A - On a cruise ship to Hawaii\n- Context: MUST START EVERY SENTENCE WITH \"How was your recent trip to\"\n\ndo you have any vacation plans for the summer?"}],"created":1748984246,"id":"cmpl-d93de2bad19f479c8a90bc00a5138092","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}

The following demonstrates performing an inference request using deployed pipeline OpenAI extended endpoint chat/completions.

# Non-streaming: Chat completion
!curl -X POST \
  -H "Authorization: Bearer abc123"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/chat/completions

{"choices":[{"delta":null,"finish_reason":"length","index":0,"message":{"content":"I am a storyteller. I strive to put words to my experiences and imaginations, telling stories that capture the heart and imagination of audiences around the world. Whether I'm sharing tales of adventure, hope, and love, or simply sharing the excitement of grand-kid opening presents on Christmas morning, I've always felt a deep calling to tell tales that inspire, uplift, and bring joy to those who hear them. From small beginn","role":"assistant","tool_calls":[]}}],"created":1748984273,"id":"chatcmpl-b26e7e82265f4e4287effe7d84914bf9","model":"vllm-openai_tinyllama.zip","object":"chat.completion","usage":{"completion_tokens":100,"prompt_tokens":49,"total_tokens":149,"tps":null,"ttft":null}}

Publish Pipeline for Edge Deployment

Wallaroo pipelines are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config) command. This uploads the following artifacts to the OCI registry:

The native vLLM runtimes or custom models with OpenAI compatibility enabled.
If specified, the deployment configuration.
The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.

Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.

The following demonstrates publishing the RAG Llama pipeline created and tested in the previous steps. Once published, it can be deployed to edge locations with the required resources matching the deployment configuration.

pipeline.publish(deployment_config=deployment_config)

Waiting for pipeline publish... It may take up to 600 sec.
.......................... Published.

ID 64

Pipeline Name tinyllama-openai-rag

Pipeline Version f87d169d-e436-4383-b19f-d863b032b24b

Status Published

Workspace Id 1689

Workspace Name vllm-openai-test

Edges

Engine URL sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:v2025.1.0-6175

Pipeline URL sample.registry.example.com/uat/pipelines/tinyllama-openai-rag:f87d169d-e436-4383-b19f-d863b032b24b

Helm Chart URL oci://sample.registry.example.com/uat/charts/tinyllama-openai-rag

Helm Chart Reference sample.registry.example.com/uat/charts@sha256:0170f78e853a9f0c8741dea808a1cbd2eec6750c0ac9d2e90936e20a260aca88

Helm Chart Version 0.0.1-f87d169d-e436-4383-b19f-d863b032b24b

Engine Config {'engine': {'resources': {'limits': {'cpu': 0.5, 'memory': '1Gi'}, 'requests': {'cpu': 0.5, 'memory': '1Gi'}, 'accel': 'none', 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none', 'cpu_utilization': 50.0}, 'images': {'ragstep-771': {'resources': {'limits': {'cpu': 1.0, 'memory': '2Gi'}, 'requests': {'cpu': 1.0, 'memory': '2Gi'}, 'accel': 'none', 'arch': 'x86', 'gpu': False}}, 'tinyllamarag-770': {'resources': {'limits': {'cpu': 1.0, 'memory': '8Gi'}, 'requests': {'cpu': 1.0, 'memory': '8Gi'}, 'accel': 'cuda', 'arch': 'x86', 'gpu': True}}}}}

User Images []

Created By sample.user@wallaroo.ai

Created At 2025-06-04 00:48:09.573663+00:00

Updated At 2025-06-04 00:48:09.573663+00:00

Replaces

Docker Run Command

docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=sample.registry.example.com/uat/pipelines/tinyllama-openai-rag:f87d169d-e436-4383-b19f-d863b032b24b \
    -e CONFIG_CPUS=1.0 --gpus all --cpus=2.5 --memory=11g \
    sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:v2025.1.0-6175

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.

Helm Install Command

helm install --atomic $HELM_INSTALL_NAME \
    oci://sample.registry.example.com/uat/charts/tinyllama-openai-rag \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-f87d169d-e436-4383-b19f-d863b032b24b \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

For access to these sample models and for a demonstration:

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today