Llama 8B in SGLang Framework with ROCm AI Acceleration Example
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Llama 8B in SGLang Framework with ROCm AI Acceleration Example
The following example demonstrates deploying a LLM using the Wallaroo SGLang framework with ROCm AI Acceleration enabled.
The tutorial demonstrates:
- Retrieving a LLM in the Wallaroo SGLang framework configured with ROCm acceleration.
- Setting Wallaroo continuous batching options and OpenAI compatibility options.
- Deploying the LLM with deployment configuration options
- Performing a sample inference.
For access to these sample models and for a demonstration of how to use a LLM Validation Listener.
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today.
References
- How to Upload and Deploy SGLang Framework Models
- Continuous Batching for LLMs
- Deploy LLMs with OpenAI Compatibility
Tutorial Steps
Import Python Libraries
The first step is to import the libraries used for this tutorial, primarily the Wallaroo SDK.
import pyarrow as pa
import wallaroo
from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.openai_config import OpenaiConfig
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.framework import SGLangConfig, VLLMConfig
Connect to the Wallaroo Instance
This step sets a connection to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Set Workspace
The following creates or connects to an existing workspace, and sets it as the current workspace. For more details on Wallaroo workspaces, see Wallaroo Workspace Management Guide.
workspace = wl.get_workspace(name='amd-llama-test', create_if_not_exist=True)
wl.set_current_workspace(workspace)
{'name': 'amd-llama-test', 'id': 15, 'archived': False, 'created_by': 'jason.mccampbell@wallaroo.ai', 'created_at': '2026-01-26T21:10:10.214594+00:00', 'models': [{'name': 'llama-31-8b-sglang-config-v1', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2026, 2, 19, 18, 39, 39, 859744, tzinfo=tzutc()), 'created_at': datetime.datetime(2026, 2, 19, 18, 39, 39, 859744, tzinfo=tzutc())}, 'pipelines': []}
Retrieve Model
For this example, the model was previously uploaded as a Llama 8B in the SGLang framework. This model was set with AMD ROCm as the AI hardware accelerator.
model = wl.get_model("llama-31-8b-sglang-config-v1")
model
| Name | llama-31-8b-sglang-config-v1 |
| Version | 3dfe7ec3-a76f-4ef6-8aa4-f6a8d1806f73 |
| File Name | llama-31-8b.zip |
| SHA | c737ece29898860ff157c548bfb727778c06a35963f06152bca4b96171691cf7 |
| Status | ready |
| Error Summary | None |
| Image Path | proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-sglang-rocm:v2026.1.0-main-6569 |
| Architecture | x86 |
| Acceleration | rocm |
| Updated At | 2026-19-Feb 18:42:26 |
| Workspace id | 15 |
| Workspace name | amd-llama-test |
Set OpenAI and Continuous Batching Configuration
OpenAI Compatibility is enabled for the Wallaroo SGLang framework. Updates to the OpenAI compatibility configuration and continuous batching configuration are set post model upload by updating the model configuration.
# Configuring as OpenAI
continuous_batch_config = ContinuousBatchingConfig(max_concurrent_batch_size=1024)
openai_config = OpenaiConfig(chat_completion_config={"temperature": .3, "max_tokens": 1024})
model = model.configure(openai_config=openai_config, continuous_batching_config=continuous_batch_config)
Deploy the Pipeline
With the model configuration options set, the model is deployed via a pipeline through these steps:
- Set the Deployment Configuration. This sets the resources allocated from the cluster for the pipeline for the model’s exclusive use.
- In this example, the model is allocated 1 cpu, 12 G RAM, and 1 gpu.
- Create the pipeline and set the model as a pipeline step.
- Deploy the pipeline with the deployment configuration.
Once deployment is complete, the pipeline is available for inference requests.
# Deploying
deployment_config = wallaroo.DeploymentConfigBuilder() \
.cpus(.5) \
.memory("1Gi") \
.sidekick_cpus(model, 1) \
.sidekick_memory(model, '12Gi') \
.sidekick_gpus(model, 1) \
.deployment_label('amd.com/gpu.product-name:AMD_Instinct_MI325X_VF') \
.build()
pipeline = wl.build_pipeline('llama-31-8b-pipe-sglang-config-v2')
pipeline.clear()
pipeline.add_model_step(model)
pipeline.deploy(deployment_config = deployment_config)
Waiting for deployment - this will take up to 600s ....................................................................................... ok
| name | llama-31-8b-pipe-sglang-config-v2 |
|---|---|
| created | 2026-02-22 20:49:38.860522+00:00 |
| last_updated | 2026-03-12 19:02:53.735956+00:00 |
| deployed | True |
| workspace_id | 15 |
| workspace_name | amd-llama-test |
| arch | x86 |
| accel | rocm |
| tags | |
| versions | 53d41331-2f6a-46dd-b950-315601ab225c, 6e774def-6481-4e5c-8337-06933fcd5bc4, 432507d7-cd93-4869-aefd-110fcd421276, 22219f64-3b8c-499e-9c92-39bdcd288720, 9d38d0e7-6574-4f57-8e10-308b54d6e9f7, e2106119-05ed-4546-a203-0d44f191d8e7, 5ca994cb-00a2-4466-a2e0-aa8993df045e, 515a23c7-513a-46a3-bf51-322ef9f57c1d, b4d677d8-8723-4e03-b98a-5c9aad07d01c, 32b70780-680c-4f24-9c11-cb8910382098 |
| steps | llama-31-8b-sglang-config-v1 |
| published | False |
pipeline.status()
{'status': 'Running',
'details': [],
'engines': [{'ip': '10.244.57.17',
'name': 'engine-0',
'status': 'Running',
'reason': None,
'details': [],
'pipeline_statuses': {'pipelines': [{'id': 'llama-31-8b-pipe-sglang-config-v2',
'status': 'Running',
'version': '53d41331-2f6a-46dd-b950-315601ab225c'}]},
'model_statuses': {'models': [{'model_version_id': 28,
'name': 'llama-31-8b-sglang-config-v1',
'sha': 'c737ece29898860ff157c548bfb727778c06a35963f06152bca4b96171691cf7',
'status': 'Running',
'version': '3dfe7ec3-a76f-4ef6-8aa4-f6a8d1806f73'}]}}],
'engine_lbs': [{'ip': '10.244.57.18',
'name': 'engine-lb-777c5f4844-r9jhr',
'status': 'Running',
'reason': None,
'details': []}],
'sidekicks': [{'ip': '10.244.57.1',
'name': 'engine-sidekick-llama-31-8b-sglang-config-v1-28-0',
'status': 'Running',
'reason': None,
'details': [],
'statuses': ''}]}
Inference Requests with OpenAI API Compatibility
Inference requests are submitted via the Wallaroo SDK wallaroo.pipeline.Pipeline.openai_chat_completion method. This accepts standard OpenAI API requests and returns the results. For more details, see Inference via OpenAI Compatibility Deployments
pipeline.openai_chat_completion(messages=[{"role": "user", "content": "are you running in a test environment?", "max_tokens": 50}]).choices[0].message.content
"I'm running in a production environment. I'm a cloud-based AI model, and I don't have a traditional test environment like a software development project would. However, my training data is constantly being updated and improved by my developers to ensure I provide the most accurate and helpful responses possible."
pipeline.openai_chat_completion(messages=[{"role": "user", "content": "what is wallaroo.ai?", "max_tokens": 50}]).choices[0].message.content
'Wallaroo.ai is a cloud-based platform that enables developers to build, deploy, and manage real-time data pipelines and event-driven applications. It provides a scalable, fault-tolerant, and highly available infrastructure for processing large volumes of data in real-time.\n\nWallaroo.ai is designed to handle complex event processing, data streaming, and IoT workloads, making it suitable for a wide range of use cases, including:\n\n1. IoT data processing\n2. Real-time analytics\n3. Event-driven applications\n4. Streaming data processing\n5. Machine learning model serving\n\nThe platform offers several key features, including:\n\n1. **Stream Processing**: Wallaroo.ai provides a stream processing engine that can handle high-volume, high-velocity data streams.\n2. **Event-Driven Architecture**: The platform supports event-driven architecture, enabling developers to build applications that respond to real-time events.\n3. **Scalability**: Wallaroo.ai is designed to scale horizontally, allowing developers to add or remove nodes as needed to handle changing workloads.\n4. **Fault Tolerance**: The platform provides built-in fault tolerance, ensuring that applications remain available even in the event of node failures.\n5. **Real-Time Data Pipelines**: Wallaroo.ai enables developers to build real-time data pipelines that can handle large volumes of data.\n\nOverall, Wallaroo.ai is a powerful platform for building and deploying real-time data pipelines and event-driven applications, making it a popular choice for developers and organizations working with large volumes of data.'
Undeploy the Pipeline
With the tutorial complete, the pipeline is undeployed; this returns the resources back to the cluster.
pipeline.undeploy()
Please log into the following URL in a web browser:
sample.wallaroo.ai/auth/realms/master/device?user_code=CDWW-YCSX
Login successful!
.................................... okke up to 600s
| name | llama-31-8b-pipe-sglang-config-v2 |
|---|---|
| created | 2026-02-22 20:49:38.860522+00:00 |
| last_updated | 2026-03-12 17:16:44.899347+00:00 |
| deployed | False |
| workspace_id | 15 |
| workspace_name | amd-llama-test |
| arch | x86 |
| accel | rocm |
| tags | |
| versions | 432507d7-cd93-4869-aefd-110fcd421276, 22219f64-3b8c-499e-9c92-39bdcd288720, 9d38d0e7-6574-4f57-8e10-308b54d6e9f7, e2106119-05ed-4546-a203-0d44f191d8e7, 5ca994cb-00a2-4466-a2e0-aa8993df045e, 515a23c7-513a-46a3-bf51-322ef9f57c1d, b4d677d8-8723-4e03-b98a-5c9aad07d01c, 32b70780-680c-4f24-9c11-cb8910382098 |
| steps | llama-31-8b-sglang-config-v1 |
| published | False |