Wallaroo MLOps API Essentials Guide: Wallaroo Dashboard Metrics Retrieval

How to use Wallaroo MLOps Api for metrics retrieval.

Wallaroo Dashboard Metrics Retrieval via the Wallaroo MLOps API

The Wallaroo MLOps API allows for metrics retrieval. These are used to track:

Inference result performance.
Deployed replicas.
Inference Latency.

These inference endpoints are compliant with Prometheus endpoints.

Supported Queries

The following queries are supported through the Metrics endpoints. The following references are used here:

pipelineID: The pipeline’s numerical identifier, retrieved from the Wallaroo SDK with wallaroo.pipeline.Pipeline.name(). For example:
```
pipeline.name()
```
```
sample-pipeline-name
```
deployment_id: The Kubernetes namespace for the deployment.

English Name	Parameterized Query	Example Query	Description
Requests per second	`sum by (pipeline_name) (rate(latency_histogram_ns_count{pipeline_name="{pipelineID}"}[{step}s]))`	`sum by (deploy_id) (rate(latency_histogram_ns_count{deploy_id="deployment_id"}[10s]))`	Number of processed requests per second to a pipeline.
Cluster inference rate	`sum by (pipeline_name) (rate(tensor_throughput_batch_count{pipeline_name="{pipelineID}"}[{step}s]))`	`sum by (deploy_id) (rate(tensor_throughput_batch_count{deploy_id="deployment_id"}[10s]))`	Number of inferences processed per second. This notably differs from requests per second when batch inference requests are made.
P50 inference latency	`histogram_quantile(0.50, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6`	`histogram_quantile(0.50, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6`	Histogram for P90 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one.
P95 inference latency	`histogram_quantile(0.95, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6`	`histogram_quantile(0.95, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6`	Histogram for P95 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one.
P99 inference latency	`histogram_quantile(0.99, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6`	`histogram_quantile(0.99, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6`	Histogram for P99 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one.
Engine replica count	`count(container_memory_usage_bytes{namespace="{pipeline_namespace}", container="engine"}) or vector(0)`	`count(container_memory_usage_bytes{namespace="deployment_id", container="engine"}) or vector(0)`	Number of engine replicas currently running in a pipeline
Sidekick replica count	`count(container_memory_usage_bytes{namespace="{pipeline_namespace}", container=~"engine-sidekick-.*"}) or vector(0)`	`count(container_memory_usage_bytes{namespace="deployment_id", container=~"engine-sidekick-.*"}) or vector(0)`	Number of sidekick replicas currently running in a pipeline
Output tokens per second (TPS)	`sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}]))`	`sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="deployment_id"}[10s]))`	LLM output tokens per second: this is the number of tokens generated per second for a LLM deployed in Wallaroo with vLLM
P99 Time to first token (TTFT)	`histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000`	`histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000`	P99 time to first token: P99 for time to generate the first token for LLMs deployed in Wallaroo with vLLM
P95 Time to first token (TTFT)	`histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000`	`histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000`	P95 time to first token: P95 for time to generate the first token for LLMs deployed in Wallaroo with vLLM
P50 Time to first token (TTFT)	`histogram_quantile(0.50, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000`	`histogram_quantile(0.50, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000`	P50 time to first token: P50 for time to generate the first token for LLMs deployed in Wallaroo with vLLM

Query Metric Request Endpoints

Endpoints:
- /v1/api/metrics/query (GET)
- /v1/api/metrics/query (POST)

For full details, see the Wallaroo MLOps API Reference Guide

Query Metric Request Parameters

Parameter	Type	Description
query	String	The Prometheus expression query string.
time	String	The evaluation timestamp in either RFC3339 format or Unix timestamp.
timeout	String	The evaluation timeout in duration format (`5m` for 5 minutes, etc).

Query Metric Request Returns

Field		Type	Description
status		String	The status of the request of either `success` or `error`.
data		Dict	The response data.
	data.resultType	String	The type of query result.
	data.result	String	DateTime of the model’s creation.
errorType		String	The error type if `status` is `error`.
errorType		String	The error messages if `status` is `error`.
warnings		Array[String]	An array of error messages.

Query Range Metric Endpoints

Endpoints
- /v1/api/metrics/query_range (GET)
- /v1/api/metrics/query_range (POST)

For full details, see the Wallaroo MLOps API Reference Guide

Query Range Metric Request Parameters

Parameter	Type	Description
query	String	The Prometheus expression query string.
start	String	The starting timestamp in either RFC3339 format or Unix timestamp, inclusive.
end	String	The ending timestamp in either RFC3339 format or Unix timestamp.
step	String	Query resolution step width in either duration format or as a float number of seconds.
timeout	String	The evaluation timeout in duration format (`5m` for 5 minutes, etc).

Query Range Metric Request Returns

Field		Type	Description
status		String	The status of the request of either `success` or `error`.
data		Dict	The response data.
	resultType	String	The type of query result. For query range, always `matrix`.
	result	String	DateTime of the model’s creation.
errorType		String	The error type if `status` is `error`.
errorType		String	The error messages if `status` is `error`.
warnings		Array[String]	An array of error messages.

Example Metric Request

The following request shows an example of a Query Range request for requests per second. For this example, the following Wallaroo SDK methods are used:

wl.api_endpoint: Retrieves the API endpoint for the Wallaroo Ops server.
wl.auth.auth_header(): Retrieves the authentication bearer tokens.

# set prometheus requirements
pipeline_id = pipeline_name # the name of the pipeline
step = "1m" # the step of the calculation

# this will also format the timezone in the parsing section
timezone = "US/Central"

selected_timezone = pytz.timezone(timezone)

# Define the start and end times
data_start = selected_timezone.localize(datetime.datetime(2025, 8, 4, 9, 0, 0))
data_end = selected_timezone.localize(datetime.datetime(2025, 8, 6, 9, 59, 59))

# this is the URL to get prometheus metrics
query_url = f"{wl.api_endpoint}/v1/metrics/api/v1/query_range"

# Retrieve the token 
headers = wl.auth.auth_header()

# Convert to UTC and get the Unix timestamps
start_timestamp = int(data_start.astimezone(pytz.UTC).timestamp())
end_timestamp = int(data_end.astimezone(pytz.UTC).timestamp())    

query_rps = f'sum by (pipeline_name) (rate(latency_histogram_ns_count{{pipeline_name="{pipeline_id}"}}[{step}]))'
#request parameters
params_rps = {
    'query': query_rps,
    'start': start_timestamp,
    'end': end_timestamp,
    'step': step
}

response_rps = requests.get(query_url, headers=headers, params=params_rps)

if response_rps.status_code == 200:
    print("Requests Per Second Data:")
    display(response_rps.json())
else:
    print("Failed to fetch RPS data:", response_rps.status_code, response_rps.text)

Requests Per Second Data:

{'status': 'success',
 'data': {'resultType': 'matrix',
  'result': [{'metric': {'pipeline_name': 'metrics-retrieval-tutorial-pipeline'},
    'values': [[1754419440, '0.61195'],
     [1754419500, '0.43636363636363634'],
     ...]}]}}

The following shows the query inference rate.

query_inference_rate = f'sum by (pipeline_name) (rate(tensor_throughput_batch_count{{pipeline_name="{pipeline_id}"}}[{step}]))'

# inference rte
params_inference_rate = {
    'query': query_inference_rate,
    'start': start_timestamp,
    'end': end_timestamp,
    'step': step
}

response_inference_rate = requests.get(query_url, headers=headers, params=params_inference_rate)

if response_inference_rate.status_code == 200:
    print("Cluster Inference Rate Data:")
    display(response_inference_rate.json())
else:
    print("Failed to fetch Inference Rate data:", response_inference_rate.status_code, response_inference_rate.text)

Cluster Inference Rate Data:

{'status': 'success',
 'data': {'resultType': 'matrix',
  'result': [{'metric': {'pipeline_name': 'metrics-retrieval-tutorial-pipeline'},
    'values': [[1754419440, '6274.9353'],
     [1754419500, '4474.472727272727'],
     ...]}]}}

TTFT Query Example

The following example uses the Output tokens per second (TPS) query for a LLM deployed with OpenAI Compatibility in Wallaroo.



# this will also format the timezone in the parsing section
timezone = "US/Central"

selected_timezone = pytz.timezone(timezone)

# Define the start and end times of 10:00 to 10:15
data_start = selected_timezone.localize(datetime.datetime(2025, 7, 14, 10, 0, 0))
data_end = selected_timezone.localize(datetime.datetime(2025, 7, 14, 10, 15, 00))

# this is the URL to get prometheus metrics
query_url = f"{wl.api_endpoint}/v1/metrics/api/v1/query_range"

import time
# Retrieve the token 
headers = wl.auth.auth_header()

# Convert to UTC and get the Unix timestamps
start_timestamp = int(data_start.astimezone(pytz.UTC).timestamp())
end_timestamp = int(data_end.astimezone(pytz.UTC).timestamp())    

pipeline_name = "llama-3-1-8b-pipeline" # the name of the pipeline
deploy_id = 210 # the deployment id
step = "5m" # the step of the calculation

query_ttft = f'histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{{kubernetes_namespace="{pipeline_name}-{deploy_id}"}}[{step}])) by (le)) * 1000'
print(query_ttft)

#request parameters
params_ttft = {
    'query': query_ttft,
    'start': start_timestamp,
    'end': end_timestamp,
    'step': step
}

response_rps = requests.get(query_url, headers=headers, params=params_ttft)

if response_rps.status_code == 200:
    #print("Requests Per Second Data:")
    result = response_rps.json()
    print(result)
else:
    print("Failed to fetch TTFT data:", response_rps.status_code, response_rps.text)

histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="llama-3-1-8b-pipeline-210"}[5m])) by (le)) * 1000
{'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric': {}, 'values': [[1752505500, '48.45656000000012'], [1752505800, '39.800000000000004'], [1752506100, 'NaN']]}]}}

Tutorials

The following tutorials demonstrate creating metrics data and retrieving it using the Wallaroo MLOps API.