Deploy Llama with Continuous Batching Using Native vLLM Framework with QAIC using OpenAI Inference
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Deploy Llama with Continuous Batching Using Native vLLM Framework and QAIC AI Acceleration using OpenAI Compatibility
The following tutorial demonstrates deploying the Llama LLM with the following enhancements:
- The Wallaroo Native vLLM Framework: Provide performance optimizations with framework configuration options.
- Continuous Batching: Configurable batch sizes balance latency and throughput use.
- QAIC AI Acceleration: x86 compatible architecture at low power with AI acceleration.
- OpenAI API compatibility: The LLM accepts inference requests using the OpenAI
completion
andchat/completion
endpoints, compatible with OpenAI API clients.
For access to these sample models and for a demonstration of how to use a LLM deployment with QAIC acceleration, OpenAI API compatibility, continuous batching, and other features:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Tutorial Goals
This tutorial demonstrates the following procedure:
- Upload a Llama LLM with:
- The Wallaroo Native vLLM runtime
- QAIC AI Acceleration enabled
- Framework configuration options to enhance performance
- After upload, set the LLM configuration options:
- Configure continuous batching and settings.
- Enable OpenAI API compatibility and set inference options.
- Set a deployment configuration to allocate hardware resources and deploy the LLM.
- Perform sample inferences via OpenAI API inference methods with and without token streaming.
Prerequisites
- Wallaroo 2025.1 and above.
- A cluster with Qualcomm Cloud AI hardware.
Tutorial Steps
Import libraries
The first step is to import the Python libraries required, mainly the Wallaroo SDK.
import base64
import wallaroo
from wallaroo.deployment_config import DeploymentConfigBuilder
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.object import EntityNotFoundError
from wallaroo.engine_config import QaicConfig
from wallaroo.framework import VLLMConfig
import pyarrow as pa
import pandas as pd
from wallaroo.openai_config import OpenaiConfig
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
Connect to the Wallaroo Instance
Next connect to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
LLM Upload
Uploading the LLM takes the following steps:
- Define Schemas: The input and output schemas are defined in Apache PyArrow format. For this tutorial, they are converted to base64 strings used for uploading through the Wallaroo MLOps API.
- Upload the model via either the Wallaroo SDK or the Wallaroo MLOps API.
Upload LLM
LLM uploads to Wallaroo are either via the Wallaroo SDK or the Wallaroo MLOps API.
The following demonstrates uploading the LLM via the SDK. In this example the QAIC acceleration configuration is defined. This is an optional step that fine tunes the QAIC AI Acceleration hardware performance to best fit the LLM.
qaic_config = QaicConfig(
num_devices=4,
full_batch_size=16,
ctx_len=256,
prefill_seq_len=128,
mxfp6_matmul=True,
mxint8_kv_cache=True
)
LLMs are uploaded with the Wallaroo SDK method wallaroo.client.Client.upload_model
. This this step, the following options are configured:
- The model name and file path.
- The framework, in this case the native vLLM runtime.
- The optional framework configuration, which sets specific options for the LLM’s performance.
- The input and output schemas. For OpenAI compatibility, these are ignored so are set as empty sets.
- The hardware acceleration set to
wallaroo.engine_config.Acceleration.QAIC.with_config
. The additionwith_config
accepts the hardware configuration options.
llm = wl.upload_model(
"llama-qaic-openai",
"llama-31-8b.zip",
framework=Framework.VLLM,
framework_config=VLLMConfig(
max_num_seqs=16,
max_model_len=256,
max_seq_len_to_capture=128,
quantization="mxfp6",
kv_cache_dtype="mxint8",
gpu_memory_utilization=1
),
input_schema=pa.schema([]),
output_schema=pa.schema([]),
accel=Acceleration.QAIC.with_config(qaic_config)
)
Waiting for model loading - this will take up to 10min.
Model is pending loading to a container runtime..
Model is attempting loading to a container runtime...................................................................................................................................................................................................................................
Successful
Ready
The other upload option is the Wallaroo MLOps API endpoint v1/api/models/upload_and_convert
. For this option, the base64 converted input and output schemas are used, and the framework_config
and accel
options are specified in dict
format. Otherwise, the same parameters are set:
- The model name and file path.
- The
conversion
parameter which defines:- The framework as native vLLM
- The optional framework configuration, which sets specific options for the LLM’s performance.
- The input and output schemas set as base64 strings.
- The
accel
parameter which specifies the AI accelerator asqaic
with the additional hardware configuration options.
curl --progress-bar -X POST \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer <your-token-here>" \
-F 'metadata={"name": "vllm-llama-31-8b-qaic-new-v1", "visibility": "private", "workspace_id": 6, "conversion": {"framework": "vllm", "framework_config": {"framework": "vllm", "config":{"max_num_seqs": 16, "max_model_len": 256, "max_seq_len_to_capture": 128, "quantization": "mxfp6", "kv_cache_dtype": "mxint8", "gpu_memory_utilization": 1}}, "accel": {"qaic":{"num_devices":4,"full_batch_size": 16, "ctx_len": 256, "prefill_seq_len": 128, "mxfp6_matmul":true,"mxint8_kv_cache":true}}, "python_version": "3.8", "requirements": []}, "input_schema": "/////7AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABUAAAABAAAAMT///8AAAECEAAAACQAAAAEAAAAAAAAAAoAAABtYXhfdG9rZW5zAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAGAAAAcHJvbXB0AAAEAAQABAAAAA==", "output_schema": "/////8AAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABcAAAABAAAALz///8AAAECEAAAACwAAAAEAAAAAAAAABEAAABudW1fb3V0cHV0X3Rva2VucwAAAAgADAAIAAcACAAAAAAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFEAAAACQAAAAEAAAAAAAAAA4AAABnZW5lcmF0ZWRfdGV4dAAABAAEAAQAAAA="};type=application/json' \
-F "file=@llama-31-8b.zip;type=application/octet-stream" \
https://qaic-poc.example.wallaroo.ai/v1/api/models/upload_and_convert | cat
When the llm is uploaded, we retrieve it via the wallaroo.client.Client.get_model
for use in later steps.
llm = wl.get_model("llama-qaic-openai")
llm
Name | llama-qaic-openai |
Version | a9fb8483-f600-4537-8278-a4316f518c2d |
File Name | llama-31-8b.zip |
SHA | 62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-qaic-vllm:v2025.1.0-6203 |
Architecture | x86 |
Acceleration | {'qaic': {'ctx_len': 256, 'num_cores': 16, 'num_devices': 4, 'mxfp6_matmul': True, 'full_batch_size': 16, 'mxint8_kv_cache': True, 'prefill_seq_len': 128, 'aic_enable_depth_first': False}} |
Updated At | 2025-16-Jun 20:52:44 |
Workspace id | 9 |
Workspace name | younes@wallaroo.ai - Default Workspace |
Configure Continuous Batching
Continuous batching options are applied for the model configuration with the model.Model.configure
parameter. This method required both the input and output schemas, and the wallaroo.continuous_batching_config.ContinuousBatchingConfig
settings.
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
cbc = ContinuousBatchingConfig(max_concurrent_batch_size = 100)
Configure OpenAI Compatibility
OpenAI compatibility options are set through the wallaroo.openai_config.OpenaiConfig
object, with the most important being:
enabled
: Enables OpenAI compatibilitycompletion_config
: Sets the OpenAIcompletion
endpoint options exceptstream
; thestream
option is only provided at inference.chat_completion_config
: Sets the OpenAIchat/completion
endpoint options exceptstream
; thestream
option is only provided at inference.
openai_config = OpenaiConfig(
enabled=True,
completion_config={
"temperature": .3,
"max_tokens": 200
},
chat_completion_config={
"temperature": .3,
"max_tokens": 200,
"chat_template": """
{% for message in messages %}
{% if message['role'] == 'user' %}
{{ '<|user|>\n' + message['content'] + eos_token }}
{% elif message['role'] == 'system' %}
{{ '<|system|>\n' + message['content'] + eos_token }}
{% elif message['role'] == 'assistant' %}
{{ '<|assistant|>\n' + message['content'] + eos_token }}
{% endif %}
{% if loop.last and add_generation_prompt %}
{{ '<|assistant|>' }}
{% endif %}
{% endfor %}"""
})
Set LLM Configuration
Both the continuous deployment and OpenAI API compatibility are set through the LLM’s configure
method.
llm = llm.configure(openai_config=openai_config, continuous_batching_config = cbc)
llm
Name | llama-qaic-openai |
Version | a9fb8483-f600-4537-8278-a4316f518c2d |
File Name | llama-31-8b.zip |
SHA | 62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838 |
Status | ready |
Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-qaic-vllm:v2025.1.0-6203 |
Architecture | x86 |
Acceleration | {'qaic': {'ctx_len': 256, 'num_cores': 16, 'num_devices': 4, 'mxfp6_matmul': True, 'full_batch_size': 16, 'mxint8_kv_cache': True, 'prefill_seq_len': 128, 'aic_enable_depth_first': False}} |
Updated At | 2025-16-Jun 20:52:44 |
Workspace id | 9 |
Workspace name | younes@wallaroo.ai - Default Workspace |
Deploy the LLM
Deploying the LLM takes the following steps:
- Set the deployment configuration.
- Deploy the LLM with the deployment configuration.
Set the Deployment Configuration
The deployment configuration determines what hardware resources allocated for the LLMs exclusive use. The LLM options are set via the sidekick
options.
For this example, the deployment hardware includes a Qualcomm AI 100 and allocates the following resources:
- Replicas: 1 minimum, maximum 2. This provides scalability with additional replicas scaled up or down automatically based on resource usage.
- Cpus: 4
- RAM: 12 Gi
- gpus: 4
- For Wallaroo deployment configurations for QAIC, the
gpu
parameter specifies the number of System-on-Chips (SoCs) allocated.
- For Wallaroo deployment configurations for QAIC, the
- Deployment label: Specifies the node with the gpus.
# sidekick_gpus is the number Qualcomm AI 100 SOCs
deployment_config = DeploymentConfigBuilder() \
.cpus(1).memory('1Gi') \
.sidekick_cpus(llm, 4) \
.sidekick_memory(llm, '12Gi') \
.sidekick_gpus(llm, 4) \
.deployment_label("kubernetes.io/os:linux") \
.build()
The LLm is applied to a Wallaroo pipeline as a pipeline step. Once set, the pipeline is deployed with the deployment configuration. When the deployment is complete, the LLM is ready for inference requests.
pipeline = wl.build_pipeline("llamaqaic-openai")
pipeline.undeploy()
pipeline.add_model_step(llm)
pipeline.deploy(deployment_config=deployment_config)
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.244.69.140',
'name': 'engine-5686f7fb48-n5kg7',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'llamaqaic-openai',
'status': 'Running',
'version': '06e205c9-0769-4681-bf0b-929de6b58613'}]},
'model_statuses': {'models': [{'model_version_id': 83,
'name': 'llama-qaic-openai',
'sha': '62c338e77c031d7c071fe25e1d202fcd1ded052377a007ebd18cb63eadddf838',
'status': 'Running',
'version': 'a9fb8483-f600-4537-8278-a4316f518c2d'}]}}],
'engine_lbs': [{'ip': '10.244.69.160',
'name': 'engine-lb-864866fb86-99w6k',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.244.69.165',
'name': 'engine-sidekick-llama-qaic-openai-83-775844444c-4qkgl',
'status': 'Running',
'reason': None,
'details': [],
'statuses': '\n'}]}
Inference Examples
LLMs deployed in Wallaroo accept pandas DataFrames as inference inputs. These examples use the OpenAI API inference via the Wallaroo SDK and API clients.
Inference requests via OpenAI compatible methods override the OpenAI configurations applied at the LLM level. This provides additional optimizations and flexibility as needed.
OpenAI Inference via the Wallaroo SDK
Inference requests with OpenAI compatible enabled models in Wallaroo via the Wallaroo SDK use the following methods:
wallaroo.pipeline.Pipeline.openai_chat_completion
: Submits an inference request using the OpenAI APIchat/completion
endpoint parameters.wallaroo.pipeline.Pipeline.openai_completion
: Submits an inference request using the OpenAI APIcompletion
endpoint parameters.
Each example demonstrates using these methods with and without token streaming.
The following demonstrates performing an inference with openai_chat_completion
. Note that the same parameters passed match the ones used for the OpenAI chat/completion
endpoint.
# Performing completions
pipeline.openai_chat_completion(messages=[{"role": "user", "content": "good morning"}]).choices[0].message.content
'Good morning. Is there something I can help you with or would you like to chat?'
This example uses the openai_completion
method.
pipeline.openai_completion(prompt="tell me about wallaroo.AI", max_tokens=200).choices[0].text
"\nWallaroo is a cloud-based AI platform that uses machine learning to help businesses automate and optimize their operations. It provides a suite of tools and features that enable companies to build, deploy, and manage AI models, as well as integrate them with existing systems and workflows.\nHere are some key features and benefits of Wallaroo:\n1. **Automated AI model deployment**: Wallaroo allows businesses to deploy AI models in a matter of minutes, without requiring extensive technical expertise.\n2. **Real-time data integration**: Wallaroo can integrate with various data sources, including databases, APIs, and IoT devices, to provide real-time insights and automate decision-making.\n3. **Predictive analytics**: Wallaroo's machine learning algorithms can analyze large datasets to identify patterns, predict outcomes, and provide actionable recommendations.\n4. **Customizable workflows**: Wallaroo enables businesses to create custom workflows that automate tasks, trigger actions, and integrate with existing systems.\n5. **Scalability and security**: Wallaroo is built"
The following examples show the same methods, with token streaming enabled.
# Now with streaming
for chunk in pipeline.openai_chat_completion(messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=100, stream=True):
print(chunk.choices[0].delta.content, end="", flush=True)
I'd love to hear it. Please go ahead and share the short story about love.
# Now with streaming
for chunk in pipeline.openai_completion(prompt="tell me about wallaroo.AI", max_tokens=200, stream=True):
print(chunk.choices[0].text, end="", flush=True)
Wallaroo is a cloud-based, AI-powered platform that enables developers to build, deploy, and manage AI and machine learning (ML) models in a scalable and secure manner. The platform provides a range of features and tools to simplify the development and deployment of AI and ML models, including:
Model development: Wallaroo provides a visual interface for building and training AI and ML models, making it easier for developers to create and deploy models without requiring extensive expertise in machine learning.
Model deployment: Wallaroo allows developers to deploy models to a variety of environments, including cloud, on-premises, and edge devices, making it easier to integrate AI and ML capabilities into existing applications.
Model management: Wallaroo provides a centralized platform for managing AI and ML models, including model versioning, model monitoring, and model security.
Scalability: Wallaroo is designed to scale with the needs of the organization, allowing developers to build and deploy models that can handle large volumes of data and traffic.
Security:
OpenAI Inference via the OpenAI API Requests
Inference requests via the OpenAI API Client use the pipeline’s deployment inference endpoint with the OpenAI API endpoints extensions. For deployments with OpenAI compatibility enabled, the following additional endpoints are provided:
{Deployment inference endpoint}/openai/v1/completions
: Compatible with the OpenAI API endpointcompletion
.{Deployment inference endpoint}/openai/v1/chat/completions
: Compatible with the OpenAI API endpointchat/completion
.
These requests require the following:
- A Wallaroo pipeline deployed with Wallaroo native vLLM runtime or Wallaroo Custom Models with OpenAI compatibility enabled.
- Authentication to the Wallaroo MLOps API. For more details, see the Wallaroo API Connection Guide.
- Access to the deployed pipeline’s OpenAPI API endpoints.
The first example shows retrieving the authentication token to the Wallaroo instance.
token = wl.auth.auth_header()['Authorization'].split()[1]
This example performs an inference request with token streaming enabled on the completions
endpoint.
# Streaming: Completion
!curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "prompt": "tell me a short story", "max_tokens": 100, "stream": true, "stream_options": {"include_usage": true}}' \
https://qaic-poc.example.wallaroo.ai/v1/api/pipelines/infer/llamaqaic-openai-32/llamaqaic-openai/openai/v1/completions
data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" about","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":1,"total_tokens":7,"ttft":0.091563446,"tps":10.921388869527693}}
data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" a","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":2,"total_tokens":8,"ttft":0.091563446,"tps":16.38253480899562}}
data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" character","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":3,"total_tokens":9,"ttft":0.091563446,"tps":16.485729285058817}}
...
data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" mere","index":0,"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":99,"total_tokens":105,"ttft":0.091563446,"tps":10.019649983937533}}
data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[{"text":" touch","index":0,"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":6,"completion_tokens":100,"total_tokens":106,"ttft":0.091563446,"tps":10.020860629999026}}
data: {"id":"cmpl-5a1adc32e65849f2aee5edf2e37fdb7a","created":1750125678,"model":"llama-31-8b.zip","choices":[],"usage":{"prompt_tokens":6,"completion_tokens":100,"total_tokens":106,"ttft":0.091563446,"tps":10.020269063389119}}
data: [DONE]
This example performs an inference request with token streaming enabled on the chat/completions
endpoint.
# Streaming: Chat completion
!curl -X POST \
-H "Authorization: Bearer abc123" \
-H "Content-Type: application/json" \
-d '{"model": "whatever", "messages": [{"role": "user", "content": "tell me a story"}], "max_tokens": 100, "stream": true, "stream_options": {"include_usage": true}}' \
https://qaic-poc.example.wallaroo.ai/v1/api/pipelines/infer/llamaqaic-openai-32/llamaqaic-openai/openai/v1/chat/completions
data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":39,"completion_tokens":0,"total_tokens":39,"ttft":0.093523807,"tps":0.0}}
data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":"Once"}}],"usage":{"prompt_tokens":39,"completion_tokens":1,"total_tokens":40,"ttft":0.093523807,"tps":10.679028938385025}}
data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":" upon"}}],"usage":{"prompt_tokens":39,"completion_tokens":2,"total_tokens":41,"ttft":0.093523807,"tps":15.893348258670727}}
...
data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":null,"message":null,"delta":{"role":null,"content":","}}],"usage":{"prompt_tokens":39,"completion_tokens":99,"total_tokens":138,"ttft":0.093523807,"tps":10.516952064534435}}
data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[{"index":0,"finish_reason":"length","message":null,"delta":{"role":null,"content":" pointing"}}],"usage":{"prompt_tokens":39,"completion_tokens":100,"total_tokens":139,"ttft":0.093523807,"tps":10.511513285045494}}
data: {"id":"chat-02fbfe3ae2b54133a28b2deffd3aaab6","object":"chat.completion.chunk","created":1750125657,"model":"llama-31-8b.zip","choices":[],"usage":{"prompt_tokens":39,"completion_tokens":100,"total_tokens":139,"ttft":0.093523807,"tps":10.511361686804662}}
data: [DONE]
The following uses the OpenAI Python library to perform the inferences, using the same OpenAI endpoints.
from openai import OpenAI
client = OpenAI(
base_url='https://qaic-poc.example.wallaroo.ai/v1/api/pipelines/infer/llamaqaic-openai-32/llamaqaic-openai/openai/v1',
api_key=token
)
for chunk in client.chat.completions.create(model="dummy", messages=[{"role": "user", "content": "this is a short story about love"}], max_tokens=100, stream=True):
print(chunk.choices[0].delta.content, end="", flush=True)
I'd love to hear it. Please go ahead and share the short story about love. I'll be happy to listen and respond.
for chunk in client.completions.create(model="dummy", prompt="tell me about wallaroo.AI", max_tokens=100, stream=True):
print(chunk.choices[0].text, end="", flush=True)
Introducing wallaroo.AI
Wallايي буду Towards a Optimization Approach
For AI-Driven Sports Optimization and Trading
Background: One-click strategy models, platform agnostic, similarity testing
Theory: Extreme Value Theory ( EVT ) , GARCH , Kalman Filter algorithm
Impact: speeding through complex sett FL startup focusing implementation wallaroo.Readingmy Business model
wallaroo.ai is an AI-driven platform that aims to revolutionize the way we approach sports optimization and trading. The platform leverages
For access to these sample models and for a demonstration of how to use a LLM deployment with QAIC acceleration, OpenAI API compatibility, continuous batching, and other features:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today