Deploy RAG LLM with OpenAI Compatibility


This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Deploy RAG LLM with OpenAI Compatibility

The following tutorial demonstrates deploying a Llama LLM with Retrieval-Augmented Generation (RAG) in Wallaroo with OpenAI API compatibility enabled. This allows developers to:

  • Take advantage of Wallaroo’s inference optimization to increase inference response times with more efficient resource allocation.
  • Migrate existing OpenAI client code with a minimum of changes.
  • Extend their LLMs capabilities with the Wallaroo Custom Model framework to add RAG functionality to an existing LLM.

Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:

  • wallaroo.framework.Framework.VLLM: Native async vLLM implementations.
  • wallaroo.framework.Framework.CUSTOM: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.

A typical situation is to either deploy the native vLLM runtime as a single model in a Wallaroo pipeline, or both the Custom Model runtime and the native vLLM runtime together in the same pipeline to extend the LLMs capabilities. In this tutorial, RAG is added to improve the context of inference requests to provide better responses and prevent AI hallucinations.

This example uses one model for RAG, and one LLM with OpenAI compatibility enabled.

For access to these sample models and for a demonstration:

Tutorial Outline

This tutorial demonstrates how to:

  • Upload a LLM with the Wallaroo native vLLM framework and a Wallaroo Custom Model with the Custom Model framework.
  • Configure the uploaded LLM to enable OpenAI API compatibility and set additional OpenAI parameters.
  • Set resource configurations for allocating cpus, memory, etc.
  • Set the Custom Model runtime and native vLLM runtime as pipeline steps and deploy in Wallaroo.
  • Submit inference request via:
    • The Wallaroo SDK methods completions and chat_completion
    • Wallaroo pipeline inference urls with OpenAI API endpoints extensions.

Tutorial Requirements

The following tutorial requires the following:

  • Wallaroo version 2025.1 and above.
  • Tiny Llama model and the Wallaroo RAG Custom Model. These are available from Wallaroo representatives upon request.

Tutorial Steps

Import Libraries

The following libraries are used for this tutorial, primarily the Wallaroo SDK.

import wallaroo
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.openai_config import OpenaiConfig
import pyarrow as pa

Connect to the Wallaroo Instance

A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client(request_timeout=600)

Create and Set the Current Workspace

This steps creates the workspace. Uploaded LLMs and pipeline deployments are set within this workspace.

workspace = wl.get_workspace(name='vllm-openai-test', create_if_not_exist=True)
wl.set_current_workspace(workspace)

Upload the LLM and Custom Model

The model is uploaded with the following parameters:

  • The model name
  • The file path to the model
  • The framework set to Wallaroo native vLLM runtime: wallaroo.framework.Framework.VLLM
  • The input and output schemas are defined in Apache PyArrow format. For OpenAI compatibility, this is left as an empty List.
  • Acceleration is set to NVIDIA CUDA for the LLM.
# Uploading the model

model_step = wl.upload_model(
    "tinyllamarag",
    "vllm-openai_tinyllama.zip",
    framework=Framework.VLLM,
    input_schema=pa.schema([]),
    output_schema=pa.schema([]),
    convert_wait=True,
    accel=Acceleration.CUDA
)
Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime.................................
Model is attempting loading to a container runtime..........................
Successful
Ready
model_step=wl.get_model("tinyllamarag")
model_step
Nametinyllamarag
Versiona7400b8e-bd7f-4982-8eb8-ab3b477b0ab7
File Namevllm-openai_tinyllama.zip
SHAdb68af9c290cdc8d047b7ac70f5acbd446435d2767ac4dfd51509b750a78bdd0
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-6174
Architecturex86
Accelerationcuda
Updated At2025-03-Jun 19:30:37
Workspace id1689
Workspace namevllm-openai-test
# Configuring as OpenAI

openai_config = OpenaiConfig(enabled=True)
model_step = model_step.configure(openai_config=openai_config)
# Uploading the model

rag_step = wl.upload_model(
    "ragstep",
    "openai_step.zip",
    framework=Framework.CUSTOM,
    input_schema=pa.schema([]),
    output_schema=pa.schema([]),
    convert_wait=True
)
Waiting for model loading - this will take up to 10min.

Model is pending loading to a container runtime..
Model is attempting loading to a container runtime.......................
Successful
Ready
rag_step=wl.get_model("ragstep")
rag_step
Nameragstep
Version043a4831-22ed-4b51-95f0-ed8cc5511a59
File Nameopenai_step.zip
SHA6f5c95e524da0a28e813dc70e81c46454f7b4594d35a23405a7c6438d7c01a29
Statusready
Image Pathproxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-6174
Architecturex86
Accelerationnone
Updated At2025-03-Jun 19:32:45
Workspace id1689
Workspace namevllm-openai-test

Enable OpenAI Compatibility

OpenAI compatibility is enabled via the model configuration from the class wallaroo.openai_config.OpenaiConfig includes the following main parameters. The essential one is enabled - if OpenAI compatibility is not enabled, all other parameters are ignored.

ParameterTypeDescription
enabledBoolean (Default: False)If True, OpenAI compatibility is enabled. If False, OpenAI compatibility is not enabled. All other parameters are ignored if enabled=False.
completion_configDictThe OpenAI API completion parameters. All completion parameters are available except stream; the stream parameter is only set at inference requests.
chat_completion_configDictThe OpenAI API chat/completion parameters. All completion parameters are available except stream; the stream parameter is only set at inference requests.

With the OpenaiConfig object defined, it is when applied to the LLM configuration through the openai_config parameter.

openai_config = OpenaiConfig(enabled=True)
rag_step = rag_step.configure(openai_config=openai_config)

Set the Deployment Configuration and Deploy

The deployment configuration defines what resources are allocated to the LLM’s exclusive use. For this tutorial, the LLM is allocated:

  • Llama LLM:
    • 1 cpu
    • 8 Gi RAM
    • 1 GPU. The GPU type is inherited from the model upload step. For QAIC, each deployment configuration gpu values is the number of System-on-Chip (SoC) to use.
  • RAG Model:
    • 1 cpu
    • 2 Gi RAM

Once the deployment configuration is set:

  • The pipeline is created.
  • The RAG model and the LLM added as a pipeline steps.
  • The pipeline is deployed with the deployment configuration.

Once the deployment is complete, the LLM is ready to receive inference requests.

# Deploying

deployment_config = wallaroo.DeploymentConfigBuilder() \
    .replica_count(1) \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(rag_step, 1) \
    .sidekick_memory(rag_step, '2Gi') \
    .sidekick_cpus(model_step, 1) \
    .sidekick_memory(model_step, '8Gi') \
    .sidekick_gpus(model_step, 1) \
    .deployment_label('wallaroo.ai/accelerator:l4') \
    .build()

pipeline = wl.build_pipeline('tinyllama-openai-rag')
pipeline.undeploy()
pipeline.clear()
pipeline.add_model_step(rag_step)
pipeline.add_model_step(model_step)
pipeline.deploy(deployment_config = deployment_config)
Waiting for undeployment - this will take up to 600s ................................... ok
Waiting for deployment - this will take up to 600s .................................................................................................................................. ok
nametinyllama-openai-rag
created2025-06-03 00:43:13.169150+00:00
last_updated2025-06-04 00:25:24.221145+00:00
deployedTrue
workspace_id1689
workspace_namevllm-openai-test
archx86
accelnone
tags
versionsca7ac20b-22a8-44ea-8897-31588cd4f4a1, 761bff73-d14c-40a6-9002-2ccef283412a, 18366faf-552e-46de-aa7e-b246c8d030a9, 83cd5a81-ba4f-431d-98f2-b6027a48aa29, a670f88f-e0cc-49a1-a7b5-d0d693cd372a, d6494410-d99e-438e-a6af-9fd381595ec6, 836c73d2-dcb5-4733-995b-bfb3cd3b5511, 1cb296d0-9e4c-45e3-a72a-6692ab279766, 082fc1b8-bedb-49ed-9ad9-dde8ed9549bd, b088f896-f29a-4ad1-a96c-c2920ab2b817, f21ed918-5681-45cd-a4fd-60a7b124c6d5, 23a26672-2ffb-410e-aa65-b4c949320700, d5517bcd-60a1-4964-ad58-87c038e9267b, 2e689953-1883-4300-8c11-7b854fe23c43, 7092df94-4ecd-4a82-8f2c-1c2577cd1641, 73040edb-ba7b-4b19-aedf-91d47f59c5cf
stepsragstep
publishedTrue

Inference Requests on LLM with OpenAI Compatibility Enabled

Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom with OpenAI compatibility enabled in Wallaroo are performed either through the Wallaroo SDK, or via OpenAPI endpoint requests.

OpenAI API inference requests on models deployed with OpenAI compatibility enabled have the following conditions:

  • Parameters for chat/completion and completion override the existing OpenAI configuration options.
  • If the stream option is enabled:
    • Outputs returned as list of chunks aka as an event stream.
    • The request inference call completes when all chunks are returned.
    • The response metadata includes ttft, tps and user-specified OpenAI request params after the last chunk is generated.

OpenAI API Inference Requests via the Wallaroo SDK and Inference Result Logs

Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:

  • wallaroo.pipeline.Pipeline.openai_chat_completion: Submits an inference request using the OpenAI API chat/completion endpoint parameters.
  • wallaroo.pipeline.Pipeline.openai_completion: Submits an inference request using the OpenAI API completion endpoint parameters.

The OpenAI metrics are provided as part of the pipeline inference logs and include the following values:

  • ttft
  • tps
  • The OpenAI request parameter values set during the inference request.

The method wallaroo.pipeline.Pipeline.logs returns a pandas DataFrame by default, with the output fields labeled out.{field}. For OpenAI inference requests, the OpenAI metrics output field is out.json. The following demonstrates retrieving the inference results log and displaying the out.json field, which includes the tps and ttft fields.

OpenAI API Inference Requests via Pipeline Deployment URLs with OpenAI Extensions

Native vLLM runtimes and Wallaroo Custom Models with OpenAI enabled perform inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions. For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:

  • {Deployment inference endpoint}/openai/v1/completions: Compatible with the OpenAI API endpoint completion.
  • {Deployment inference endpoint}/openai/v1/chat/completions: Compatible with the OpenAI API endpoint chat/completion.

These requests require the following:

  • A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
  • Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
  • Access to the deployed pipeline’s OpenAPI API endpoints.

Inference and Inference Results Logs Examples

The following demonstrates performing an inference request using openai_chat_completion with token streaming enabled.

# Now with streaming. "Howdy" should appear in most responses
while True:
    print('\n----\n', flush=True)
    for chunk in pipeline.openai_chat_completion(messages=[
        {"role": "user", "content": "you are a good short story teller"}
    ], max_tokens=1000, stream=True):
        print(chunk.choices[0].delta.content, end="", flush=True)

----

I am not capable of expressing myself as emotionally as a person can, but I can provide you with a sample short story. The short story follows the steps of how "howdy!" will be the opening line of a meaningful conversation.

1. Introduction: The scene opens in a lobby of a bustling tourist destination. A group of travelers have gathered around a group of locals. They exchange pleasantries and begin introducing themselves. One of the travelers, let's name her sarah, speaks up.

"Hi, I'm sarah. How's your trip going?"

The locals respond, "Exceptional. Thanks for joining us!"

2. Excitement: Sarah follows up with a "Howdy!" to the locals. This echoes the foundational principle of starting every conversation with "howdy." The locals exchange a short greeting, "Howdy," in response.

3. Conversation: Meanwhile, the travelers continue chatting with the locals. Eventually, they make their way to their hotel room. The locals continue to chat during the hotel ride, asking sarah about her plans for the weekend.

4. Comfort and Connection: Sarah is fascinated with the locals' ability to strike up long-lasting conversations. She empathizes with the locals' tendency to connect with strangers.

5. Progress: The locals continue their conversation with sarah. They discuss their hobbies, lifestyle, and share stories about their family and friends.

6. Dilemma: As they continue their conversation, sarah notices the locals becoming frustrated as they can't seem to string together useful questions. She reflects on how comfortable she felt with strangers during her travels.

7. Change in Viewpoint: Sarah realizes that she's not so intimidated by strangers, and she starts opening up too. She talks about her hobbies as well as her hopes for the future.

8. Shared Passion: Sarah shares about her love for taking pictures, which prompts the locals to chime in. She sees their shared passion for adventure and shares her excitement to explore new places.

9. Closure: Sarah shares the origin of this concept. She's from a country where the locals often say "howdy" to visitors. She explains how exciting this simple shorthand has been for her.

10. Conclusion: Sarah returns to her hotel room, grateful for the unique conversation experience. She knew that there were a lot of similarities between her experiences and the locals' connection-building.

succinctly, sarah explains that "howdy," is a powerful strategy for starting any meaningful conversation with a new person. The story demonstrates how different perspectives can unlock meaningful conversations by sharing stories and building trust.

The following demonstrates retrieving the inference result logs for the recent openai_chat_completion with token streaming enabled request.

from datetime import datetime, timedelta
pipeline.logs(start_datetime=datetime.now() - timedelta(minutes=5), end_datetime=datetime.now()).iloc[-1]['out.json']
'{"choices":[{"delta":null,"finish_reason":null,"index":0,"message":{"content":"I am not capable of expressing myself as emotionally as a person can, but I can provide you with a sample short story. The short story follows the steps of how \\"howdy!\\" will be the opening line of a meaningful conversation.\\n\\n1. Introduction: The scene opens in a lobby of a bustling tourist destination. A group of travelers have gathered around a group of locals. They exchange pleasantries and begin introducing themselves. One of the travelers, let\'s name her sarah, speaks up.\\n\\n\\"Hi, I\'m sarah. How\'s your trip going?\\"\\n\\nThe locals respond, \\"Exceptional. Thanks for joining us!\\"\\n\\n2. Excitement: Sarah follows up with a \\"Howdy!\\" to the locals. This echoes the foundational principle of starting every conversation with \\"howdy.\\" The locals exchange a short greeting, \\"Howdy,\\" in response.\\n\\n3. Conversation: Meanwhile, the travelers continue chatting with the locals. Eventually, they make their way to their hotel room. The locals continue to chat during the hotel ride, asking sarah about her plans for the weekend.\\n\\n4. Comfort and Connection: Sarah is fascinated with the locals\' ability to strike up long-lasting conversations. She empathizes with the locals\' tendency to connect with strangers.\\n\\n5. Progress: The locals continue their conversation with sarah. They discuss their hobbies, lifestyle, and share stories about their family and friends.\\n\\n6. Dilemma: As they continue their conversation, sarah notices the locals becoming frustrated as they can\'t seem to string together useful questions. She reflects on how comfortable she felt with strangers during her travels.\\n\\n7. Change in Viewpoint: Sarah realizes that she\'s not so intimidated by strangers, and she starts opening up too. She talks about her hobbies as well as her hopes for the future.\\n\\n8. Shared Passion: Sarah shares about her love for taking pictures, which prompts the locals to chime in. She sees their shared passion for adventure and shares her excitement to explore new places.\\n\\n9. Closure: Sarah shares the origin of this concept. She\'s from a country where the locals often say \\"howdy\\" to visitors. She explains how exciting this simple shorthand has been for her.\\n\\n10. Conclusion: Sarah returns to her hotel room, grateful for the unique conversation experience. She knew that there were a lot of similarities between her experiences and the locals\' connection-building.\\n\\nsuccinctly, sarah explains that \\"howdy,\\" is a powerful strategy for starting any meaningful conversation with a new person. The story demonstrates how different perspectives can unlock meaningful conversations by sharing stories and building trust.","role":null}}],"created":1748996917,"id":"chatcmpl-199eff3baf834668a29f4ad777d58178","model":"vllm-openai_tinyllama.zip","object":"chat.completion.chunk","usage":{"completion_tokens":639,"prompt_tokens":51,"total_tokens":690,"tps":94.19393440980764,"ttft":0.020341099}}'

The following command connects the OpenAI client to the deployed pipeline’s OpenAI endpoint.

# Now using the OpenAI client

token = wl.auth.auth_header()['Authorization'].split()[1]

from openai import OpenAI
client = OpenAI(
    base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1',
    api_key=token
)

The following demonstrates performing an inference request using the OpenAI API completions endpoint.

client.completions.create(model="", prompt="tell me a short story", max_tokens=100).choices[0].text
" to keep me awake at night. - a quick story to put on hold till brighter times - How Loki's cylinder isn't meaningful anymore; remember that Loki is the lying one!\nthese last two sentences could be sophisticated supporting context sentences that emphasizes Loki's comedy presence - emphasize the exaggerated quality of Imogen's hyperactive relationships, and how she helps Loki to laugh - or if you want a plot"

The following demonstrates retrieving the inference result logs for the recent OpenAI API completions endpoint.

pipeline.logs(start_datetime=datetime.now() - timedelta(minutes=3), end_datetime=datetime.now()).iloc[-1]['out.json']
'{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" to keep me awake at night. - a quick story to put on hold till brighter times - How Loki\'s cylinder isn\'t meaningful anymore; remember that Loki is the lying one!\\nthese last two sentences could be sophisticated supporting context sentences that emphasizes Loki\'s comedy presence - emphasize the exaggerated quality of Imogen\'s hyperactive relationships, and how she helps Loki to laugh - or if you want a plot"}],"created":1748997757,"id":"cmpl-6eaeb190a246424a80b30256ce5716f2","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}'

The following demonstrates performing an inference request using the Wallaroo SDK openai_completion method with token streaming enabled.

# Streaming: Completion
for chunk in pipeline.openai_completion(prompt="tell me a short story", max_tokens=1000, stream=True):
    print(chunk.choices[0].text, end="", flush=True)
 of a time when there was an emergency that required me to rush to help someone in a foreign country. Make sure the story is engaging, appropriate for a tourist's ability to comprehend, and includes exaurding sensory details of your experience. Include conversion rates for the emergency response/sMS service you use, if applicable. Use proper grammar and appropriate writing style throughout.

The following demonstrates performing an inference request using the Wallaroo SDK openai_chat_completion with token streaming enabled.

# Streaming: Chat completion
for chunk in pipeline.openai_chat_completion(messages=[
        {"role": "user", "content": "you are a story teller"}
    ],
    max_tokens=100,
    stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)
I appreciate that you find my stories intriguing, but I am not the ones who are telling stories. Instead, I am just a software that processes and interprets patterns of language based on spoken remarks. Stories are mostly the result of imaginative and creative thinking, and anyone can create excellent stories. I am just an intermediate bridge between an original works and the ones enjoyed by millions of people around the world.

that being said, let me present you with an original story

The following performs an inference request using the Wallaroo SDK method openai_completion.

# Non-streaming: Completion
response = pipeline.openai_completion(prompt="tell me a short story", max_tokens=100)
print(response.choices[0].text)
 to wrap up the meeting

Investigation: Sometimes quite fascinated by the series of accidents in the factory, the safety inspectors decided to investigate.

Context: END SENTENCE WITH "I investigated... Time."

the investigation was a protracted tribulation I was involved with

Investigation: Did the investigators really think that my time-consuming investigation would lead to a fluctuation in the factory's production figures

The following performs an inference request using the Wallaroo SDK method openai_chat_completion.

# Non-streaming: Chat Completion
response = pipeline.openai_chat_completion(messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100)
print(response.choices[0].message.content)
Of course, I'm a storyteller! But not all stories are meant to be seriolosely presented. I appreciate your compliment, and it's always exciting when I can twist a well-known story in a new and unexpected way. Here's a fictional tale inspired by Buffy the Vampire Slayer:

Title: "The Lost Buffy Book"

Introduction: 
"The Loners, a killer

The following command connects the OpenAI client to the deployed pipeline’s OpenAI endpoint.

######## OpenAI Client #########

token = wl.auth.auth_header()['Authorization'].split()[1]

from openai import OpenAI
client = OpenAI(
    base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1',
    api_key=token
)

The following demonstrates performing an inference request using the completions endpoint with token streaming enabled.

# Streaming: Completion
for chunk in client.completions.create(model="", prompt="tell me a short story", max_tokens=100, stream=True):
    print(chunk.choices[0].text, end="", flush=True)
 about a person who discovered a hidden talent

Voiceover: "Here's a true story about a guy named Jack."

[Intro Film Clip: Cut to an intimate shot of a character sitting in a cozy living room, reading a book]

Narrator: "Jack was sitting across from his beloved book, deep in thought. You see, Jack had always been content on his day-to-day life as an account

The following demonstrates performing an inference request using the chat/completions endpoint with token streaming enabled.

# Streaming: Chat completion
for chunk in client.chat.completions.create(model="dummy", messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100, stream=True):
    print(chunk.choices[0].delta.content, end="", flush=True)
I am not capable of true storytelling like humans. However, I can definitely help you compose a compelling story. When it comes to using "howdy!" as the starting sentence in your story, here are some tips that might help:

1. Make it first-person: Instead of starting your story with a third-person narrator, start with 'I was going', focusing on your own experience of starting the day off on the right foot. This will make your

The following demonstrates performing an inference request using the completions endpoint.

# Non-streaming: Completion
response = client.completions.create(model="whatever", prompt="tell me a short story", max_tokens=100)
print(response.choices[0].text)
 Authors have written ingenious stories about small country people, small towns, and small lifestyles. Here’s one that is light and entertaining:

Title: The Big Cheese of High School

Episode One: Annabelle, a sophomore in high school, has just missed the kiss-off that seemed like just a hiccup. However, Annabelle is a gentle soul, and her pendulum swings further out of control

The following demonstrates performing an inference request using the chat/completions endpoint.

# Non-streaming: Chat completion
response = client.chat.completions.create(model="",messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100)
print(response.choices[0].message.content)
Thank you for admiring my writing skills! Here's an example of how to use a greeting in a sentence:

Syntax sentence: "Excuse me, but can I have a moment of your time?"

Meaning: I am a friendly and polite person who is looking for brief conversation with someone else.

The response from the person in question could be: "Sure, let me give it a try."

**Imagery sentences

The following command demonstrates using the Wallaroo SDK to retrieve the authentication bearer token. This is used to authenticate for making Wallaroo API calls. For more details, see Wallaroo API Connection Guide.

token = wl.auth.auth_header()['Authorization'].split()[1]
token
'abc123'

The following demonstrates performing an inference request using deployed pipeline OpenAI endpoint the completions endpoint with token streaming enabled.

# Streaming: Completion
!curl -X POST \
  -H "Authorization: Bearer abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100, "stream": true}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/completions
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" in","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" third","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" person","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

...

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" reader","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" to","index":0,"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":null}

data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[],"usage":{"prompt_tokens":27,"completion_tokens":100,"total_tokens":127,"ttft":0.023214041,"tps":93.92361686654164}}

data: [DONE]

The following demonstrates performing an inference request using deployed pipeline OpenAI endpoint the chat/completions endpoint with token streaming enabled.

# Streaming: Chat completion
!curl -X POST \
  -H "Authorization: Bearer abc123"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100, "stream": true}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/chat/completions
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":"assistant","content":""}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"I"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" am"}}],"usage":null}


...

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"ils"}}],"usage":null}

data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":"length","message":null,"delta":{"role":null,"content":","}}],"usage":null}

data: [DONE]

The following demonstrates performing an inference request using deployed pipeline OpenAI extended endpoint completions.

# Non-streaming: Completion
!curl -X POST \
  -H "Authorization: Bearer abc123"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/completions
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" about your summer vacation!\n\n- B - Inyl Convenience Store, Japan\n- Context: MUST BE SET IN AN AMERICAN SUMMER VACATION\n\nhow was your recent trip to japan?\n\n- A - On a cruise ship to Hawaii\n- Context: MUST START EVERY SENTENCE WITH \"How was your recent trip to\"\n\ndo you have any vacation plans for the summer?"}],"created":1748984246,"id":"cmpl-d93de2bad19f479c8a90bc00a5138092","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}

The following demonstrates performing an inference request using deployed pipeline OpenAI extended endpoint chat/completions.

# Non-streaming: Chat completion
!curl -X POST \
  -H "Authorization: Bearer abc123"  \
  -H "Content-Type: application/json" \
  -d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100}' \
  https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/chat/completions
{"choices":[{"delta":null,"finish_reason":"length","index":0,"message":{"content":"I am a storyteller. I strive to put words to my experiences and imaginations, telling stories that capture the heart and imagination of audiences around the world. Whether I'm sharing tales of adventure, hope, and love, or simply sharing the excitement of grand-kid opening presents on Christmas morning, I've always felt a deep calling to tell tales that inspire, uplift, and bring joy to those who hear them. From small beginn","role":"assistant","tool_calls":[]}}],"created":1748984273,"id":"chatcmpl-b26e7e82265f4e4287effe7d84914bf9","model":"vllm-openai_tinyllama.zip","object":"chat.completion","usage":{"completion_tokens":100,"prompt_tokens":49,"total_tokens":149,"tps":null,"ttft":null}}

Publish Pipeline for Edge Deployment

Wallaroo pipelines are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config) command. This uploads the following artifacts to the OCI registry:

  • The native vLLM runtimes or custom models with OpenAI compatibility enabled.
  • If specified, the deployment configuration.
  • The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.

Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.

The following demonstrates publishing the RAG Llama pipeline created and tested in the previous steps. Once published, it can be deployed to edge locations with the required resources matching the deployment configuration.

pipeline.publish(deployment_config=deployment_config)
Waiting for pipeline publish... It may take up to 600 sec.
.......................... Published.
ID64
Pipeline Nametinyllama-openai-rag
Pipeline Versionf87d169d-e436-4383-b19f-d863b032b24b
StatusPublished
Workspace Id1689
Workspace Namevllm-openai-test
Edges
Engine URLsample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:v2025.1.0-6175
Pipeline URLsample.registry.example.com/uat/pipelines/tinyllama-openai-rag:f87d169d-e436-4383-b19f-d863b032b24b
Helm Chart URLoci://sample.registry.example.com/uat/charts/tinyllama-openai-rag
Helm Chart Referencesample.registry.example.com/uat/charts@sha256:0170f78e853a9f0c8741dea808a1cbd2eec6750c0ac9d2e90936e20a260aca88
Helm Chart Version0.0.1-f87d169d-e436-4383-b19f-d863b032b24b
Engine Config{'engine': {'resources': {'limits': {'cpu': 0.5, 'memory': '1Gi'}, 'requests': {'cpu': 0.5, 'memory': '1Gi'}, 'accel': 'none', 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none', 'cpu_utilization': 50.0}, 'images': {'ragstep-771': {'resources': {'limits': {'cpu': 1.0, 'memory': '2Gi'}, 'requests': {'cpu': 1.0, 'memory': '2Gi'}, 'accel': 'none', 'arch': 'x86', 'gpu': False}}, 'tinyllamarag-770': {'resources': {'limits': {'cpu': 1.0, 'memory': '8Gi'}, 'requests': {'cpu': 1.0, 'memory': '8Gi'}, 'accel': 'cuda', 'arch': 'x86', 'gpu': True}}}}}
User Images[]
Created Bysample.user@wallaroo.ai
Created At2025-06-04 00:48:09.573663+00:00
Updated At2025-06-04 00:48:09.573663+00:00
Replaces
Docker Run Command
docker run \
    -p $EDGE_PORT:8080 \
    -e OCI_USERNAME=$OCI_USERNAME \
    -e OCI_PASSWORD=$OCI_PASSWORD \
    -e PIPELINE_URL=sample.registry.example.com/uat/pipelines/tinyllama-openai-rag:f87d169d-e436-4383-b19f-d863b032b24b \
    -e CONFIG_CPUS=1.0 --gpus all --cpus=2.5 --memory=11g \
    sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:v2025.1.0-6175

Note: Please set the EDGE_PORT, OCI_USERNAME, and OCI_PASSWORD environment variables.
Helm Install Command
helm install --atomic $HELM_INSTALL_NAME \
    oci://sample.registry.example.com/uat/charts/tinyllama-openai-rag \
    --namespace $HELM_INSTALL_NAMESPACE \
    --version 0.0.1-f87d169d-e436-4383-b19f-d863b032b24b \
    --set ociRegistry.username=$OCI_USERNAME \
    --set ociRegistry.password=$OCI_PASSWORD

Note: Please set the HELM_INSTALL_NAME, HELM_INSTALL_NAMESPACE, OCI_USERNAME, and OCI_PASSWORD environment variables.

For access to these sample models and for a demonstration: