Llama 8B in SGLang Framework with ROCm AI Acceleration Example

This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.

Llama 8B in SGLang Framework with ROCm AI Acceleration Example

The following example demonstrates deploying a LLM using the Wallaroo SGLang framework with ROCm AI Acceleration enabled.

The tutorial demonstrates:

Retrieving a LLM in the Wallaroo SGLang framework configured with ROCm acceleration.
Setting Wallaroo continuous batching options and OpenAI compatibility options.
Deploying the LLM with deployment configuration options
Performing a sample inference.

For access to these sample models and for a demonstration of how to use a LLM Validation Listener.

Contact your Wallaroo Support Representative OR
Schedule Your Wallaroo.AI Demo Today.

References

Tutorial Steps

Import Python Libraries

The first step is to import the libraries used for this tutorial, primarily the Wallaroo SDK.

import pyarrow as pa
import wallaroo

from wallaroo.framework import Framework
from wallaroo.engine_config import Acceleration
from wallaroo.openai_config import OpenaiConfig
from wallaroo.continuous_batching_config import ContinuousBatchingConfig
from wallaroo.framework import SGLangConfig, VLLMConfig

Connect to the Wallaroo Instance

This step sets a connection to Wallaroo through the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.

This is accomplished using the wallaroo.Client() command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.

If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client(). For more information on Wallaroo Client settings, see the Client Connection guide.

wl = wallaroo.Client()

Set Workspace

The following creates or connects to an existing workspace, and sets it as the current workspace. For more details on Wallaroo workspaces, see Wallaroo Workspace Management Guide.

workspace = wl.get_workspace(name='amd-llama-test', create_if_not_exist=True)
wl.set_current_workspace(workspace)

{'name': 'amd-llama-test', 'id': 15, 'archived': False, 'created_by': 'jason.mccampbell@wallaroo.ai', 'created_at': '2026-01-26T21:10:10.214594+00:00', 'models': [{'name': 'llama-31-8b-sglang-config-v1', 'versions': 1, 'owner_id': '""', 'last_update_time': datetime.datetime(2026, 2, 19, 18, 39, 39, 859744, tzinfo=tzutc()), 'created_at': datetime.datetime(2026, 2, 19, 18, 39, 39, 859744, tzinfo=tzutc())}, 'pipelines': []}

Retrieve Model

For this example, the model was previously uploaded as a Llama 8B in the SGLang framework. This model was set with AMD ROCm as the AI hardware accelerator.

model = wl.get_model("llama-31-8b-sglang-config-v1")
model

Name	llama-31-8b-sglang-config-v1
Version	3dfe7ec3-a76f-4ef6-8aa4-f6a8d1806f73
File Name	llama-31-8b.zip
SHA	c737ece29898860ff157c548bfb727778c06a35963f06152bca4b96171691cf7
Status	ready
Error Summary	None
Image Path	proxy.replicated.com/proxy/wallaroo/ghcr.io/wallaroolabs/mac-deploy-sglang-rocm:v2026.1.0-main-6569
Architecture	x86
Acceleration	rocm
Updated At	2026-19-Feb 18:42:26
Workspace id	15
Workspace name	amd-llama-test

Set OpenAI and Continuous Batching Configuration

OpenAI Compatibility is enabled for the Wallaroo SGLang framework. Updates to the OpenAI compatibility configuration and continuous batching configuration are set post model upload by updating the model configuration.

# Configuring as OpenAI

continuous_batch_config = ContinuousBatchingConfig(max_concurrent_batch_size=1024)
openai_config = OpenaiConfig(chat_completion_config={"temperature": .3, "max_tokens": 1024})
model = model.configure(openai_config=openai_config, continuous_batching_config=continuous_batch_config)

Deploy the Pipeline

With the model configuration options set, the model is deployed via a pipeline through these steps:

Set the Deployment Configuration. This sets the resources allocated from the cluster for the pipeline for the model’s exclusive use.
- In this example, the model is allocated 1 cpu, 12 G RAM, and 1 gpu.
Create the pipeline and set the model as a pipeline step.
Deploy the pipeline with the deployment configuration.

Once deployment is complete, the pipeline is available for inference requests.

# Deploying

deployment_config = wallaroo.DeploymentConfigBuilder() \
    .cpus(.5) \
    .memory("1Gi") \
    .sidekick_cpus(model, 1) \
    .sidekick_memory(model, '12Gi') \
    .sidekick_gpus(model, 1) \
    .deployment_label('amd.com/gpu.product-name:AMD_Instinct_MI325X_VF') \
    .build()

pipeline = wl.build_pipeline('llama-31-8b-pipe-sglang-config-v2')
pipeline.clear()
pipeline.add_model_step(model)
pipeline.deploy(deployment_config = deployment_config)

Waiting for deployment - this will take up to 600s ....................................................................................... ok

name	llama-31-8b-pipe-sglang-config-v2
created	2026-02-22 20:49:38.860522+00:00
last_updated	2026-03-12 19:02:53.735956+00:00
deployed	True
workspace_id	15
workspace_name	amd-llama-test
arch	x86
accel	rocm
tags
versions	53d41331-2f6a-46dd-b950-315601ab225c, 6e774def-6481-4e5c-8337-06933fcd5bc4, 432507d7-cd93-4869-aefd-110fcd421276, 22219f64-3b8c-499e-9c92-39bdcd288720, 9d38d0e7-6574-4f57-8e10-308b54d6e9f7, e2106119-05ed-4546-a203-0d44f191d8e7, 5ca994cb-00a2-4466-a2e0-aa8993df045e, 515a23c7-513a-46a3-bf51-322ef9f57c1d, b4d677d8-8723-4e03-b98a-5c9aad07d01c, 32b70780-680c-4f24-9c11-cb8910382098
steps	llama-31-8b-sglang-config-v1
published	False

pipeline.status()

{'status': 'Running',
 'details': [],
 'engines': [{'ip': '10.244.57.17',
   'name': 'engine-0',
   'status': 'Running',
   'reason': None,
   'details': [],
   'pipeline_statuses': {'pipelines': [{'id': 'llama-31-8b-pipe-sglang-config-v2',
      'status': 'Running',
      'version': '53d41331-2f6a-46dd-b950-315601ab225c'}]},
   'model_statuses': {'models': [{'model_version_id': 28,
      'name': 'llama-31-8b-sglang-config-v1',
      'sha': 'c737ece29898860ff157c548bfb727778c06a35963f06152bca4b96171691cf7',
      'status': 'Running',
      'version': '3dfe7ec3-a76f-4ef6-8aa4-f6a8d1806f73'}]}}],
 'engine_lbs': [{'ip': '10.244.57.18',
   'name': 'engine-lb-777c5f4844-r9jhr',
   'status': 'Running',
   'reason': None,
   'details': []}],
 'sidekicks': [{'ip': '10.244.57.1',
   'name': 'engine-sidekick-llama-31-8b-sglang-config-v1-28-0',
   'status': 'Running',
   'reason': None,
   'details': [],
   'statuses': ''}]}

Inference Requests with OpenAI API Compatibility

Inference requests are submitted via the Wallaroo SDK wallaroo.pipeline.Pipeline.openai_chat_completion method. This accepts standard OpenAI API requests and returns the results. For more details, see Inference via OpenAI Compatibility Deployments

pipeline.openai_chat_completion(messages=[{"role": "user", "content": "are you running in a test environment?", "max_tokens": 50}]).choices[0].message.content

"I'm running in a production environment. I'm a cloud-based AI model, and I don't have a traditional test environment like a software development project would. However, my training data is constantly being updated and improved by my developers to ensure I provide the most accurate and helpful responses possible."

pipeline.openai_chat_completion(messages=[{"role": "user", "content": "what is wallaroo.ai?", "max_tokens": 50}]).choices[0].message.content

'Wallaroo.ai is a cloud-based platform that enables developers to build, deploy, and manage real-time data pipelines and event-driven applications. It provides a scalable, fault-tolerant, and highly available infrastructure for processing large volumes of data in real-time.\n\nWallaroo.ai is designed to handle complex event processing, data streaming, and IoT workloads, making it suitable for a wide range of use cases, including:\n\n1. IoT data processing\n2. Real-time analytics\n3. Event-driven applications\n4. Streaming data processing\n5. Machine learning model serving\n\nThe platform offers several key features, including:\n\n1. **Stream Processing**: Wallaroo.ai provides a stream processing engine that can handle high-volume, high-velocity data streams.\n2. **Event-Driven Architecture**: The platform supports event-driven architecture, enabling developers to build applications that respond to real-time events.\n3. **Scalability**: Wallaroo.ai is designed to scale horizontally, allowing developers to add or remove nodes as needed to handle changing workloads.\n4. **Fault Tolerance**: The platform provides built-in fault tolerance, ensuring that applications remain available even in the event of node failures.\n5. **Real-Time Data Pipelines**: Wallaroo.ai enables developers to build real-time data pipelines that can handle large volumes of data.\n\nOverall, Wallaroo.ai is a powerful platform for building and deploying real-time data pipelines and event-driven applications, making it a popular choice for developers and organizations working with large volumes of data.'

Undeploy the Pipeline

With the tutorial complete, the pipeline is undeployed; this returns the resources back to the cluster.

pipeline.undeploy()

Please log into the following URL in a web browser:

	sample.wallaroo.ai/auth/realms/master/device?user_code=CDWW-YCSX

Login successful!
.................................... okke up to 600s

name	llama-31-8b-pipe-sglang-config-v2
created	2026-02-22 20:49:38.860522+00:00
last_updated	2026-03-12 17:16:44.899347+00:00
deployed	False
workspace_id	15
workspace_name	amd-llama-test
arch	x86
accel	rocm
tags
versions	432507d7-cd93-4869-aefd-110fcd421276, 22219f64-3b8c-499e-9c92-39bdcd288720, 9d38d0e7-6574-4f57-8e10-308b54d6e9f7, e2106119-05ed-4546-a203-0d44f191d8e7, 5ca994cb-00a2-4466-a2e0-aa8993df045e, 515a23c7-513a-46a3-bf51-322ef9f57c1d, b4d677d8-8723-4e03-b98a-5c9aad07d01c, 32b70780-680c-4f24-9c11-cb8910382098
steps	llama-31-8b-sglang-config-v1
published	False