Deploy RAG LLM with OpenAI Compatibility
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Deploy RAG LLM with OpenAI Compatibility
The following tutorial demonstrates deploying a Llama LLM with Retrieval-Augmented Generation (RAG) in Wallaroo with OpenAI API compatibility enabled. This allows developers to:
- Take advantage of Wallaroo’s inference optimization to increase inference response times with more efficient resource allocation.
- Migrate existing OpenAI client code with a minimum of changes.
- Extend their LLMs capabilities with the Wallaroo Custom Model framework to add RAG functionality to an existing LLM.
Wallaroo supports OpenAI compatibility for LLMs through the following Wallaroo frameworks:
wallaroo.framework.Framework.VLLM
: Native async vLLM implementations.wallaroo.framework.Framework.CUSTOM
: Wallaroo Custom Models provide greater flexibility through a lightweight Python interface. This is typically used in the same pipeline as a native vLLM implementation to provide additional features such as Retrieval-Augmented Generation (RAG), monitoring, etc.
A typical situation is to either deploy the native vLLM runtime as a single model in a Wallaroo pipeline, or both the Custom Model runtime and the native vLLM runtime together in the same pipeline to extend the LLMs capabilities. In this tutorial, RAG is added to improve the context of inference requests to provide better responses and prevent AI hallucinations.
This example uses one model for RAG, and one LLM with OpenAI compatibility enabled.
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Tutorial Outline
This tutorial demonstrates how to:
- Upload a LLM with the Wallaroo native vLLM framework and a Wallaroo Custom Model with the Custom Model framework.
- Configure the uploaded LLM to enable OpenAI API compatibility and set additional OpenAI parameters.
- Set resource configurations for allocating cpus, memory, etc.
- Set the Custom Model runtime and native vLLM runtime as pipeline steps and deploy in Wallaroo.
- Submit inference request via:
- The Wallaroo SDK methods
completions
andchat_completion
- Wallaroo pipeline inference urls with OpenAI API endpoints extensions.
- The Wallaroo SDK methods
Tutorial Requirements
The following tutorial requires the following:
- Wallaroo version 2025.1 and above.
- Tiny Llama model and the Wallaroo RAG Custom Model. These are available from Wallaroo representatives upon request.
Tutorial Steps
Import Libraries
The following libraries are used for this tutorial, primarily the Wallaroo SDK.
import wallaroo
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.openai_config import OpenaiConfig
import pyarrow as pa
Connect to the Wallaroo Instance
A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client(request_timeout=600)
Create and Set the Current Workspace
This steps creates the workspace. Uploaded LLMs and pipeline deployments are set within this workspace.
workspace = wl.get_workspace(name='vllm-openai-test', create_if_not_exist=True)
wl.set_current_workspace(workspace)
Upload the LLM and Custom Model
The model is uploaded with the following parameters:
- The model name
- The file path to the model
- The framework set to Wallaroo native vLLM runtime:
wallaroo.framework.Framework.VLLM
- The input and output schemas are defined in Apache PyArrow format. For OpenAI compatibility, this is left as an empty List.
- Acceleration is set to NVIDIA CUDA for the LLM.
# Uploading the model
model_step = wl.upload_model(
"tinyllamarag",
"vllm-openai_tinyllama.zip",
framework=Framework.VLLM,
input_schema=pa.schema([]),
output_schema=pa.schema([]),
convert_wait=True,
accel=Acceleration.CUDA
)
Waiting for model loading - this will take up to 10min.
Model is pending loading to a container runtime.................................
Model is attempting loading to a container runtime..........................
Successful
Ready
model_step=wl.get_model("tinyllamarag")
model_step
Name | tinyllamarag |
Version | a7400b8e-bd7f-4982-8eb8-ab3b477b0ab7 |
File Name | vllm-openai_tinyllama.zip |
SHA | db68af9c290cdc8d047b7ac70f5acbd446435d2767ac4dfd51509b750a78bdd0 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-6174 |
Architecture | x86 |
Acceleration | cuda |
Updated At | 2025-03-Jun 19:30:37 |
Workspace id | 1689 |
Workspace name | vllm-openai-test |
# Configuring as OpenAI
openai_config = OpenaiConfig(enabled=True)
model_step = model_step.configure(openai_config=openai_config)
# Uploading the model
rag_step = wl.upload_model(
"ragstep",
"openai_step.zip",
framework=Framework.CUSTOM,
input_schema=pa.schema([]),
output_schema=pa.schema([]),
convert_wait=True
)
Waiting for model loading - this will take up to 10min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime.......................
Successful
Ready
rag_step=wl.get_model("ragstep")
rag_step
Name | ragstep |
Version | 043a4831-22ed-4b51-95f0-ed8cc5511a59 |
File Name | openai_step.zip |
SHA | 6f5c95e524da0a28e813dc70e81c46454f7b4594d35a23405a7c6438d7c01a29 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy:v2025.1.0-6174 |
Architecture | x86 |
Acceleration | none |
Updated At | 2025-03-Jun 19:32:45 |
Workspace id | 1689 |
Workspace name | vllm-openai-test |
Enable OpenAI Compatibility
OpenAI compatibility is enabled via the model configuration from the class wallaroo.openai_config.OpenaiConfig
includes the following main parameters. The essential one is enabled
- if OpenAI compatibility is not enabled, all other parameters are ignored.
Parameter | Type | Description |
---|---|---|
enabled | Boolean (Default: False) | If True , OpenAI compatibility is enabled. If False , OpenAI compatibility is not enabled. All other parameters are ignored if enabled=False . |
completion_config | Dict | The OpenAI API completion parameters. All completion parameters are available except stream ; the stream parameter is only set at inference requests. |
chat_completion_config | Dict | The OpenAI API chat/completion parameters. All completion parameters are available except stream ; the stream parameter is only set at inference requests. |
With the OpenaiConfig
object defined, it is when applied to the LLM configuration through the openai_config
parameter.
openai_config = OpenaiConfig(enabled=True)
rag_step = rag_step.configure(openai_config=openai_config)
Set the Deployment Configuration and Deploy
The deployment configuration defines what resources are allocated to the LLM’s exclusive use. For this tutorial, the LLM is allocated:
- Llama LLM:
- 1 cpu
- 8 Gi RAM
- 1 GPU. The GPU type is inherited from the model upload step. For QAIC, each deployment configuration
gpu
values is the number of System-on-Chip (SoC) to use.
- RAG Model:
- 1 cpu
- 2 Gi RAM
Once the deployment configuration is set:
- The pipeline is created.
- The RAG model and the LLM added as a pipeline steps.
- The pipeline is deployed with the deployment configuration.
Once the deployment is complete, the LLM is ready to receive inference requests.
# Deploying
deployment_config = wallaroo.DeploymentConfigBuilder() \
.replica_count(1) \
.cpus(.5) \
.memory("1Gi") \
.sidekick_cpus(rag_step, 1) \
.sidekick_memory(rag_step, '2Gi') \
.sidekick_cpus(model_step, 1) \
.sidekick_memory(model_step, '8Gi') \
.sidekick_gpus(model_step, 1) \
.deployment_label('wallaroo.ai/accelerator:l4') \
.build()
pipeline = wl.build_pipeline('tinyllama-openai-rag')
pipeline.undeploy()
pipeline.clear()
pipeline.add_model_step(rag_step)
pipeline.add_model_step(model_step)
pipeline.deploy(deployment_config = deployment_config)
Waiting for undeployment - this will take up to 600s ................................... ok
Waiting for deployment - this will take up to 600s .................................................................................................................................. ok
name | tinyllama-openai-rag |
---|---|
created | 2025-06-03 00:43:13.169150+00:00 |
last_updated | 2025-06-04 00:25:24.221145+00:00 |
deployed | True |
workspace_id | 1689 |
workspace_name | vllm-openai-test |
arch | x86 |
accel | none |
tags | |
versions | ca7ac20b-22a8-44ea-8897-31588cd4f4a1, 761bff73-d14c-40a6-9002-2ccef283412a, 18366faf-552e-46de-aa7e-b246c8d030a9, 83cd5a81-ba4f-431d-98f2-b6027a48aa29, a670f88f-e0cc-49a1-a7b5-d0d693cd372a, d6494410-d99e-438e-a6af-9fd381595ec6, 836c73d2-dcb5-4733-995b-bfb3cd3b5511, 1cb296d0-9e4c-45e3-a72a-6692ab279766, 082fc1b8-bedb-49ed-9ad9-dde8ed9549bd, b088f896-f29a-4ad1-a96c-c2920ab2b817, f21ed918-5681-45cd-a4fd-60a7b124c6d5, 23a26672-2ffb-410e-aa65-b4c949320700, d5517bcd-60a1-4964-ad58-87c038e9267b, 2e689953-1883-4300-8c11-7b854fe23c43, 7092df94-4ecd-4a82-8f2c-1c2577cd1641, 73040edb-ba7b-4b19-aedf-91d47f59c5cf |
steps | ragstep |
published | True |
Inference Requests on LLM with OpenAI Compatibility Enabled
Inference requests on Wallaroo pipelines deployed with native vLLM runtimes or Wallaroo Custom with OpenAI compatibility enabled in Wallaroo are performed either through the Wallaroo SDK, or via OpenAPI endpoint requests.
OpenAI API inference requests on models deployed with OpenAI compatibility enabled have the following conditions:
- Parameters for
chat/completion
andcompletion
override the existing OpenAI configuration options. - If the
stream
option is enabled:- Outputs returned as list of chunks aka as an event stream.
- The request inference call completes when all chunks are returned.
- The response metadata includes
ttft
,tps
and user-specified OpenAI request params after the last chunk is generated.
OpenAI API Inference Requests via the Wallaroo SDK and Inference Result Logs
Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:
wallaroo.pipeline.Pipeline.openai_chat_completion
: Submits an inference request using the OpenAI APIchat/completion
endpoint parameters.wallaroo.pipeline.Pipeline.openai_completion
: Submits an inference request using the OpenAI APIcompletion
endpoint parameters.
The OpenAI metrics are provided as part of the pipeline inference logs and include the following values:
ttft
tps
- The OpenAI request parameter values set during the inference request.
The method wallaroo.pipeline.Pipeline.logs
returns a pandas DataFrame by default, with the output fields labeled out.{field}
. For OpenAI inference requests, the OpenAI metrics output field is out.json
. The following demonstrates retrieving the inference results log and displaying the out.json
field, which includes the tps
and ttft
fields.
OpenAI API Inference Requests via Pipeline Deployment URLs with OpenAI Extensions
Native vLLM runtimes and Wallaroo Custom Models with OpenAI enabled perform inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions. For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:
{Deployment inference endpoint}/openai/v1/completions
: Compatible with the OpenAI API endpointcompletion
.{Deployment inference endpoint}/openai/v1/chat/completions
: Compatible with the OpenAI API endpointchat/completion
.
These requests require the following:
- A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
- Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
- Access to the deployed pipeline’s OpenAPI API endpoints.
Inference and Inference Results Logs Examples
The following demonstrates performing an inference request using openai_chat_completion
with token streaming enabled.
# Now with streaming. "Howdy" should appear in most responses
while True:
print('\n----\n', flush=True)
for chunk in pipeline.openai_chat_completion(messages=[
{"role": "user", "content": "you are a good short story teller"}
], max_tokens=1000, stream=True):
print(chunk.choices[0].delta.content, end="", flush=True)
----
I am not capable of expressing myself as emotionally as a person can, but I can provide you with a sample short story. The short story follows the steps of how "howdy!" will be the opening line of a meaningful conversation.
1. Introduction: The scene opens in a lobby of a bustling tourist destination. A group of travelers have gathered around a group of locals. They exchange pleasantries and begin introducing themselves. One of the travelers, let's name her sarah, speaks up.
"Hi, I'm sarah. How's your trip going?"
The locals respond, "Exceptional. Thanks for joining us!"
2. Excitement: Sarah follows up with a "Howdy!" to the locals. This echoes the foundational principle of starting every conversation with "howdy." The locals exchange a short greeting, "Howdy," in response.
3. Conversation: Meanwhile, the travelers continue chatting with the locals. Eventually, they make their way to their hotel room. The locals continue to chat during the hotel ride, asking sarah about her plans for the weekend.
4. Comfort and Connection: Sarah is fascinated with the locals' ability to strike up long-lasting conversations. She empathizes with the locals' tendency to connect with strangers.
5. Progress: The locals continue their conversation with sarah. They discuss their hobbies, lifestyle, and share stories about their family and friends.
6. Dilemma: As they continue their conversation, sarah notices the locals becoming frustrated as they can't seem to string together useful questions. She reflects on how comfortable she felt with strangers during her travels.
7. Change in Viewpoint: Sarah realizes that she's not so intimidated by strangers, and she starts opening up too. She talks about her hobbies as well as her hopes for the future.
8. Shared Passion: Sarah shares about her love for taking pictures, which prompts the locals to chime in. She sees their shared passion for adventure and shares her excitement to explore new places.
9. Closure: Sarah shares the origin of this concept. She's from a country where the locals often say "howdy" to visitors. She explains how exciting this simple shorthand has been for her.
10. Conclusion: Sarah returns to her hotel room, grateful for the unique conversation experience. She knew that there were a lot of similarities between her experiences and the locals' connection-building.
succinctly, sarah explains that "howdy," is a powerful strategy for starting any meaningful conversation with a new person. The story demonstrates how different perspectives can unlock meaningful conversations by sharing stories and building trust.
The following demonstrates retrieving the inference result logs for the recent openai_chat_completion
with token streaming enabled request.
from datetime import datetime, timedelta
pipeline.logs(start_datetime=datetime.now() - timedelta(minutes=5), end_datetime=datetime.now()).iloc[-1]['out.json']
'{"choices":[{"delta":null,"finish_reason":null,"index":0,"message":{"content":"I am not capable of expressing myself as emotionally as a person can, but I can provide you with a sample short story. The short story follows the steps of how \\"howdy!\\" will be the opening line of a meaningful conversation.\\n\\n1. Introduction: The scene opens in a lobby of a bustling tourist destination. A group of travelers have gathered around a group of locals. They exchange pleasantries and begin introducing themselves. One of the travelers, let\'s name her sarah, speaks up.\\n\\n\\"Hi, I\'m sarah. How\'s your trip going?\\"\\n\\nThe locals respond, \\"Exceptional. Thanks for joining us!\\"\\n\\n2. Excitement: Sarah follows up with a \\"Howdy!\\" to the locals. This echoes the foundational principle of starting every conversation with \\"howdy.\\" The locals exchange a short greeting, \\"Howdy,\\" in response.\\n\\n3. Conversation: Meanwhile, the travelers continue chatting with the locals. Eventually, they make their way to their hotel room. The locals continue to chat during the hotel ride, asking sarah about her plans for the weekend.\\n\\n4. Comfort and Connection: Sarah is fascinated with the locals\' ability to strike up long-lasting conversations. She empathizes with the locals\' tendency to connect with strangers.\\n\\n5. Progress: The locals continue their conversation with sarah. They discuss their hobbies, lifestyle, and share stories about their family and friends.\\n\\n6. Dilemma: As they continue their conversation, sarah notices the locals becoming frustrated as they can\'t seem to string together useful questions. She reflects on how comfortable she felt with strangers during her travels.\\n\\n7. Change in Viewpoint: Sarah realizes that she\'s not so intimidated by strangers, and she starts opening up too. She talks about her hobbies as well as her hopes for the future.\\n\\n8. Shared Passion: Sarah shares about her love for taking pictures, which prompts the locals to chime in. She sees their shared passion for adventure and shares her excitement to explore new places.\\n\\n9. Closure: Sarah shares the origin of this concept. She\'s from a country where the locals often say \\"howdy\\" to visitors. She explains how exciting this simple shorthand has been for her.\\n\\n10. Conclusion: Sarah returns to her hotel room, grateful for the unique conversation experience. She knew that there were a lot of similarities between her experiences and the locals\' connection-building.\\n\\nsuccinctly, sarah explains that \\"howdy,\\" is a powerful strategy for starting any meaningful conversation with a new person. The story demonstrates how different perspectives can unlock meaningful conversations by sharing stories and building trust.","role":null}}],"created":1748996917,"id":"chatcmpl-199eff3baf834668a29f4ad777d58178","model":"vllm-openai_tinyllama.zip","object":"chat.completion.chunk","usage":{"completion_tokens":639,"prompt_tokens":51,"total_tokens":690,"tps":94.19393440980764,"ttft":0.020341099}}'
The following command connects the OpenAI client to the deployed pipeline’s OpenAI endpoint.
# Now using the OpenAI client
token = wl.auth.auth_header()['Authorization'].split()[1]
from openai import OpenAI
client = OpenAI(
base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1',
api_key=token
)
The following demonstrates performing an inference request using the OpenAI API completions
endpoint.
client.completions.create(model="", prompt="tell me a short story", max_tokens=100).choices[0].text
" to keep me awake at night. - a quick story to put on hold till brighter times - How Loki's cylinder isn't meaningful anymore; remember that Loki is the lying one!\nthese last two sentences could be sophisticated supporting context sentences that emphasizes Loki's comedy presence - emphasize the exaggerated quality of Imogen's hyperactive relationships, and how she helps Loki to laugh - or if you want a plot"
The following demonstrates retrieving the inference result logs for the recent OpenAI API completions
endpoint.
pipeline.logs(start_datetime=datetime.now() - timedelta(minutes=3), end_datetime=datetime.now()).iloc[-1]['out.json']
'{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" to keep me awake at night. - a quick story to put on hold till brighter times - How Loki\'s cylinder isn\'t meaningful anymore; remember that Loki is the lying one!\\nthese last two sentences could be sophisticated supporting context sentences that emphasizes Loki\'s comedy presence - emphasize the exaggerated quality of Imogen\'s hyperactive relationships, and how she helps Loki to laugh - or if you want a plot"}],"created":1748997757,"id":"cmpl-6eaeb190a246424a80b30256ce5716f2","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}'
The following demonstrates performing an inference request using the Wallaroo SDK openai_completion
method with token streaming enabled.
# Streaming: Completion
for chunk in pipeline.openai_completion(prompt="tell me a short story", max_tokens=1000, stream=True):
print(chunk.choices[0].text, end="", flush=True)
of a time when there was an emergency that required me to rush to help someone in a foreign country. Make sure the story is engaging, appropriate for a tourist's ability to comprehend, and includes exaurding sensory details of your experience. Include conversion rates for the emergency response/sMS service you use, if applicable. Use proper grammar and appropriate writing style throughout.
The following demonstrates performing an inference request using the Wallaroo SDK openai_chat_completion
with token streaming enabled.
# Streaming: Chat completion
for chunk in pipeline.openai_chat_completion(messages=[
{"role": "user", "content": "you are a story teller"}
],
max_tokens=100,
stream=True):
print(chunk.choices[0].delta.content, end="", flush=True)
I appreciate that you find my stories intriguing, but I am not the ones who are telling stories. Instead, I am just a software that processes and interprets patterns of language based on spoken remarks. Stories are mostly the result of imaginative and creative thinking, and anyone can create excellent stories. I am just an intermediate bridge between an original works and the ones enjoyed by millions of people around the world.
that being said, let me present you with an original story
The following performs an inference request using the Wallaroo SDK method openai_completion
.
# Non-streaming: Completion
response = pipeline.openai_completion(prompt="tell me a short story", max_tokens=100)
print(response.choices[0].text)
to wrap up the meeting
Investigation: Sometimes quite fascinated by the series of accidents in the factory, the safety inspectors decided to investigate.
Context: END SENTENCE WITH "I investigated... Time."
the investigation was a protracted tribulation I was involved with
Investigation: Did the investigators really think that my time-consuming investigation would lead to a fluctuation in the factory's production figures
The following performs an inference request using the Wallaroo SDK method openai_chat_completion
.
# Non-streaming: Chat Completion
response = pipeline.openai_chat_completion(messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100)
print(response.choices[0].message.content)
Of course, I'm a storyteller! But not all stories are meant to be seriolosely presented. I appreciate your compliment, and it's always exciting when I can twist a well-known story in a new and unexpected way. Here's a fictional tale inspired by Buffy the Vampire Slayer:
Title: "The Lost Buffy Book"
Introduction:
"The Loners, a killer
The following command connects the OpenAI client to the deployed pipeline’s OpenAI endpoint.
######## OpenAI Client #########
token = wl.auth.auth_header()['Authorization'].split()[1]
from openai import OpenAI
client = OpenAI(
base_url='https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1',
api_key=token
)
The following demonstrates performing an inference request using the completions
endpoint with token streaming enabled.
# Streaming: Completion
for chunk in client.completions.create(model="", prompt="tell me a short story", max_tokens=100, stream=True):
print(chunk.choices[0].text, end="", flush=True)
about a person who discovered a hidden talent
Voiceover: "Here's a true story about a guy named Jack."
[Intro Film Clip: Cut to an intimate shot of a character sitting in a cozy living room, reading a book]
Narrator: "Jack was sitting across from his beloved book, deep in thought. You see, Jack had always been content on his day-to-day life as an account
The following demonstrates performing an inference request using the chat/completions
endpoint with token streaming enabled.
# Streaming: Chat completion
for chunk in client.chat.completions.create(model="dummy", messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100, stream=True):
print(chunk.choices[0].delta.content, end="", flush=True)
I am not capable of true storytelling like humans. However, I can definitely help you compose a compelling story. When it comes to using "howdy!" as the starting sentence in your story, here are some tips that might help:
1. Make it first-person: Instead of starting your story with a third-person narrator, start with 'I was going', focusing on your own experience of starting the day off on the right foot. This will make your
The following demonstrates performing an inference request using the completions
endpoint.
# Non-streaming: Completion
response = client.completions.create(model="whatever", prompt="tell me a short story", max_tokens=100)
print(response.choices[0].text)
Authors have written ingenious stories about small country people, small towns, and small lifestyles. Here’s one that is light and entertaining:
Title: The Big Cheese of High School
Episode One: Annabelle, a sophomore in high school, has just missed the kiss-off that seemed like just a hiccup. However, Annabelle is a gentle soul, and her pendulum swings further out of control
The following demonstrates performing an inference request using the chat/completions
endpoint.
# Non-streaming: Chat completion
response = client.chat.completions.create(model="",messages=[{"role": "user", "content": "you are a story teller"}], max_tokens=100)
print(response.choices[0].message.content)
Thank you for admiring my writing skills! Here's an example of how to use a greeting in a sentence:
Syntax sentence: "Excuse me, but can I have a moment of your time?"
Meaning: I am a friendly and polite person who is looking for brief conversation with someone else.
The response from the person in question could be: "Sure, let me give it a try."
**Imagery sentences
The following command demonstrates using the Wallaroo SDK to retrieve the authentication bearer token. This is used to authenticate for making Wallaroo API calls. For more details, see Wallaroo API Connection Guide.
token = wl.auth.auth_header()['Authorization'].split()[1]
token
'abc123'
The following demonstrates performing an inference request using deployed pipeline OpenAI endpoint the completions
endpoint with token streaming enabled.
# Streaming: Completion
!curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100, "stream": true}' \
https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/completions
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" in","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" third","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" person","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
...
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" reader","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[{"text":" to","index":0,"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":null}
data: {"id":"cmpl-4c8fafef0ab7493788d76d8191037d7e","created":1748998066,"model":"vllm-openai_tinyllama.zip","choices":[],"usage":{"prompt_tokens":27,"completion_tokens":100,"total_tokens":127,"ttft":0.023214041,"tps":93.92361686654164}}
data: [DONE]
The following demonstrates performing an inference request using deployed pipeline OpenAI endpoint the chat/completions
endpoint with token streaming enabled.
# Streaming: Chat completion
!curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100, "stream": true}' \
https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/chat/completions
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":"assistant","content":""}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"I"}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" am"}}],"usage":null}
...
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"ils"}}],"usage":null}
data: {"id":"chatcmpl-469c5da8a45f4988ab97830564e26304","object":"chat.completion.chunk","created":1748984212,"model":"vllm-openai_tinyllama.zip","choices":[{"index":0,"finish_reason":"length","message":null,"delta":{"role":null,"content":","}}],"usage":null}
data: [DONE]
The following demonstrates performing an inference request using deployed pipeline OpenAI extended endpoint completions
.
# Non-streaming: Completion
!curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100}' \
https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/completions
{"choices":[{"finish_reason":"length","index":0,"logprobs":null,"stop_reason":null,"text":" about your summer vacation!\n\n- B - Inyl Convenience Store, Japan\n- Context: MUST BE SET IN AN AMERICAN SUMMER VACATION\n\nhow was your recent trip to japan?\n\n- A - On a cruise ship to Hawaii\n- Context: MUST START EVERY SENTENCE WITH \"How was your recent trip to\"\n\ndo you have any vacation plans for the summer?"}],"created":1748984246,"id":"cmpl-d93de2bad19f479c8a90bc00a5138092","model":"vllm-openai_tinyllama.zip","usage":{"completion_tokens":100,"prompt_tokens":27,"total_tokens":127,"tps":null,"ttft":null}}
The following demonstrates performing an inference request using deployed pipeline OpenAI extended endpoint chat/completions
.
# Non-streaming: Chat completion
!curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "messages": [{"role": "user", "content": "you are a story teller"}], "max_tokens": 100}' \
https://example.wallaroo.ai/v1/api/pipelines/infer/tinyllama-openai-rag-415/tinyllama-openai-rag/openai/v1/chat/completions
{"choices":[{"delta":null,"finish_reason":"length","index":0,"message":{"content":"I am a storyteller. I strive to put words to my experiences and imaginations, telling stories that capture the heart and imagination of audiences around the world. Whether I'm sharing tales of adventure, hope, and love, or simply sharing the excitement of grand-kid opening presents on Christmas morning, I've always felt a deep calling to tell tales that inspire, uplift, and bring joy to those who hear them. From small beginn","role":"assistant","tool_calls":[]}}],"created":1748984273,"id":"chatcmpl-b26e7e82265f4e4287effe7d84914bf9","model":"vllm-openai_tinyllama.zip","object":"chat.completion","usage":{"completion_tokens":100,"prompt_tokens":49,"total_tokens":149,"tps":null,"ttft":null}}
Publish Pipeline for Edge Deployment
Wallaroo pipelines are published to Open Container Initiative (OCI) Registries for remote/edge deployments via the wallaroo.pipeline.Pipeline.publish(deployment_config)
command. This uploads the following artifacts to the OCI registry:
- The native vLLM runtimes or custom models with OpenAI compatibility enabled.
- If specified, the deployment configuration.
- The Wallaroo engine for the architecture and AI accelerator, both inherited from the model settings at model upload.
Once the publish process is complete, the pipeline can be deployed to one or more edge/remote environments.
The following demonstrates publishing the RAG Llama pipeline created and tested in the previous steps. Once published, it can be deployed to edge locations with the required resources matching the deployment configuration.
pipeline.publish(deployment_config=deployment_config)
Waiting for pipeline publish... It may take up to 600 sec.
.......................... Published.
ID | 64 | |
Pipeline Name | tinyllama-openai-rag | |
Pipeline Version | f87d169d-e436-4383-b19f-d863b032b24b | |
Status | Published | |
Workspace Id | 1689 | |
Workspace Name | vllm-openai-test | |
Edges | ||
Engine URL | sample.registry.example.com/uat/engines/proxy/wallaroo/ghcr.io/wallaroolabs/fitzroy-mini:v2025.1.0-6175 | |
Pipeline URL | sample.registry.example.com/uat/pipelines/tinyllama-openai-rag:f87d169d-e436-4383-b19f-d863b032b24b | |
Helm Chart URL | oci://sample.registry.example.com/uat/charts/tinyllama-openai-rag | |
Helm Chart Reference | sample.registry.example.com/uat/charts@sha256:0170f78e853a9f0c8741dea808a1cbd2eec6750c0ac9d2e90936e20a260aca88 | |
Helm Chart Version | 0.0.1-f87d169d-e436-4383-b19f-d863b032b24b | |
Engine Config | {'engine': {'resources': {'limits': {'cpu': 0.5, 'memory': '1Gi'}, 'requests': {'cpu': 0.5, 'memory': '1Gi'}, 'accel': 'none', 'arch': 'x86', 'gpu': False}}, 'engineAux': {'autoscale': {'type': 'none', 'cpu_utilization': 50.0}, 'images': {'ragstep-771': {'resources': {'limits': {'cpu': 1.0, 'memory': '2Gi'}, 'requests': {'cpu': 1.0, 'memory': '2Gi'}, 'accel': 'none', 'arch': 'x86', 'gpu': False}}, 'tinyllamarag-770': {'resources': {'limits': {'cpu': 1.0, 'memory': '8Gi'}, 'requests': {'cpu': 1.0, 'memory': '8Gi'}, 'accel': 'cuda', 'arch': 'x86', 'gpu': True}}}}} | |
User Images | [] | |
Created By | sample.user@wallaroo.ai | |
Created At | 2025-06-04 00:48:09.573663+00:00 | |
Updated At | 2025-06-04 00:48:09.573663+00:00 | |
Replaces | ||
Docker Run Command |
Note: Please set the EDGE_PORT , OCI_USERNAME , and OCI_PASSWORD environment variables. | |
Helm Install Command |
Note: Please set the HELM_INSTALL_NAME , HELM_INSTALL_NAMESPACE ,
OCI_USERNAME , and OCI_PASSWORD environment variables. |
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today