The Wallaroo MLOps API allows for metrics retrieval. These are used to track:
These inference endpoints are compliant with Prometheus endpoints.
pipelineID
: The pipeline’s numerical identifier, retrieved from the Wallaroo SDK with wallaroo.pipeline.Pipeline.name()
. For example:
pipeline.name()
sample-pipeline-name
deployment_id
: The Kubernetes namespace for the deployment.
English Name | Parameterized Query | Example Query | Description |
---|---|---|---|
Requests per second | sum by (pipeline_name) (rate(latency_histogram_ns_count{pipeline_name="{pipelineID}"}[{step}s])) | sum by (deploy_id) (rate(latency_histogram_ns_count{deploy_id="deployment_id"}[10s])) | Number of processed requests per second to a pipeline. |
Cluster inference rate | sum by (pipeline_name) (rate(tensor_throughput_batch_count{pipeline_name="{pipelineID}"}[{step}s])) | sum by (deploy_id) (rate(tensor_throughput_batch_count{deploy_id="deployment_id"}[10s])) | Number of inferences processed per second. This notably differs from requests per second when batch inference requests are made. |
P50 inference latency | histogram_quantile(0.50, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6 | histogram_quantile(0.50, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6 | Histogram for P90 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
P95 inference latency | histogram_quantile(0.95, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6 | histogram_quantile(0.95, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6 | Histogram for P95 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
P99 inference latency | histogram_quantile(0.99, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6 | histogram_quantile(0.99, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6 | Histogram for P99 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
Engine replica count | count(container_memory_usage_bytes{namespace="{pipeline_namespace}", container="engine"}) or vector(0) | count(container_memory_usage_bytes{namespace="deployment_id", container="engine"}) or vector(0) | Number of engine replicas currently running in a pipeline |
Sidekick replica count | count(container_memory_usage_bytes{namespace="{pipeline_namespace}", container=~"engine-sidekick-.*"}) or vector(0) | count(container_memory_usage_bytes{namespace="deployment_id", container=~"engine-sidekick-.*"}) or vector(0) | Number of sidekick replicas currently running in a pipeline |
Output tokens per second (TPS) | sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) | sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="deployment_id"}[10s])) | LLM output tokens per second: this is the number of tokens generated per second for a LLM deployed in Wallaroo with vLLM |
P99 Time to first token (TTFT) | histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000 | histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000 | P99 time to first token: P99 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |
P95 Time to first token (TTFT) | histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000 | histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000 | P95 time to first token: P95 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |
P50 Time to first token (TTFT) | histogram_quantile(0.50, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000 | histogram_quantile(0.50, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000 | P50 time to first token: P50 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |
/v1/api/metrics/query
(GET)/v1/api/metrics/query
(POST)For full details, see the Wallaroo MLOps API Reference Guide
Parameter | Type | Description |
---|---|---|
query | String | The Prometheus expression query string. |
time | String | The evaluation timestamp in either RFC3339 format or Unix timestamp. |
timeout | String | The evaluation timeout in duration format (5m for 5 minutes, etc). |
Field | Type | Description | |
---|---|---|---|
status | String | The status of the request of either success or error . | |
data | Dict | The response data. | |
data.resultType | String | The type of query result. | |
data.result | String | DateTime of the model’s creation. | |
errorType | String | The error type if status is error . | |
errorType | String | The error messages if status is error . | |
warnings | Array[String] | An array of error messages. |
/v1/api/metrics/query_range
(GET)/v1/api/metrics/query_range
(POST)For full details, see the Wallaroo MLOps API Reference Guide
Parameter | Type | Description |
---|---|---|
query | String | The Prometheus expression query string. |
start | String | The starting timestamp in either RFC3339 format or Unix timestamp, inclusive. |
end | String | The ending timestamp in either RFC3339 format or Unix timestamp. |
step | String | Query resolution step width in either duration format or as a float number of seconds. |
timeout | String | The evaluation timeout in duration format (5m for 5 minutes, etc). |
Field | Type | Description | |
---|---|---|---|
status | String | The status of the request of either success or error . | |
data | Dict | The response data. | |
resultType | String | The type of query result. For query range, always matrix . | |
result | String | DateTime of the model’s creation. | |
errorType | String | The error type if status is error . | |
errorType | String | The error messages if status is error . | |
warnings | Array[String] | An array of error messages. |
The following request shows an example of a Query Range request for requests per second. For this example, the following Wallaroo SDK methods are used:
wl.api_endpoint
: Retrieves the API endpoint for the Wallaroo Ops server.wl.auth.auth_header()
: Retrieves the authentication bearer tokens.# set prometheus requirements
pipeline_id = pipeline_name # the name of the pipeline
step = "1m" # the step of the calculation
# this will also format the timezone in the parsing section
timezone = "US/Central"
selected_timezone = pytz.timezone(timezone)
# Define the start and end times
data_start = selected_timezone.localize(datetime.datetime(2025, 8, 4, 9, 0, 0))
data_end = selected_timezone.localize(datetime.datetime(2025, 8, 6, 9, 59, 59))
# this is the URL to get prometheus metrics
query_url = f"{wl.api_endpoint}/v1/metrics/api/v1/query_range"
# Retrieve the token
headers = wl.auth.auth_header()
# Convert to UTC and get the Unix timestamps
start_timestamp = int(data_start.astimezone(pytz.UTC).timestamp())
end_timestamp = int(data_end.astimezone(pytz.UTC).timestamp())
query_rps = f'sum by (pipeline_name) (rate(latency_histogram_ns_count{{pipeline_name="{pipeline_id}"}}[{step}]))'
#request parameters
params_rps = {
'query': query_rps,
'start': start_timestamp,
'end': end_timestamp,
'step': step
}
response_rps = requests.get(query_url, headers=headers, params=params_rps)
if response_rps.status_code == 200:
print("Requests Per Second Data:")
display(response_rps.json())
else:
print("Failed to fetch RPS data:", response_rps.status_code, response_rps.text)
Requests Per Second Data:
{'status': 'success',
'data': {'resultType': 'matrix',
'result': [{'metric': {'pipeline_name': 'metrics-retrieval-tutorial-pipeline'},
'values': [[1754419440, '0.61195'],
[1754419500, '0.43636363636363634'],
...]}]}}
The following shows the query inference rate.
query_inference_rate = f'sum by (pipeline_name) (rate(tensor_throughput_batch_count{{pipeline_name="{pipeline_id}"}}[{step}]))'
# inference rte
params_inference_rate = {
'query': query_inference_rate,
'start': start_timestamp,
'end': end_timestamp,
'step': step
}
response_inference_rate = requests.get(query_url, headers=headers, params=params_inference_rate)
if response_inference_rate.status_code == 200:
print("Cluster Inference Rate Data:")
display(response_inference_rate.json())
else:
print("Failed to fetch Inference Rate data:", response_inference_rate.status_code, response_inference_rate.text)
Cluster Inference Rate Data:
{'status': 'success',
'data': {'resultType': 'matrix',
'result': [{'metric': {'pipeline_name': 'metrics-retrieval-tutorial-pipeline'},
'values': [[1754419440, '6274.9353'],
[1754419500, '4474.472727272727'],
...]}]}}
The following example uses the Output tokens per second (TPS) query for a LLM deployed with OpenAI Compatibility in Wallaroo.
# this will also format the timezone in the parsing section
timezone = "US/Central"
selected_timezone = pytz.timezone(timezone)
# Define the start and end times of 10:00 to 10:15
data_start = selected_timezone.localize(datetime.datetime(2025, 7, 14, 10, 0, 0))
data_end = selected_timezone.localize(datetime.datetime(2025, 7, 14, 10, 15, 00))
# this is the URL to get prometheus metrics
query_url = f"{wl.api_endpoint}/v1/metrics/api/v1/query_range"
import time
# Retrieve the token
headers = wl.auth.auth_header()
# Convert to UTC and get the Unix timestamps
start_timestamp = int(data_start.astimezone(pytz.UTC).timestamp())
end_timestamp = int(data_end.astimezone(pytz.UTC).timestamp())
pipeline_name = "llama-3-1-8b-pipeline" # the name of the pipeline
deploy_id = 210 # the deployment id
step = "5m" # the step of the calculation
query_ttft = f'histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{{kubernetes_namespace="{pipeline_name}-{deploy_id}"}}[{step}])) by (le)) * 1000'
print(query_ttft)
#request parameters
params_ttft = {
'query': query_ttft,
'start': start_timestamp,
'end': end_timestamp,
'step': step
}
response_rps = requests.get(query_url, headers=headers, params=params_ttft)
if response_rps.status_code == 200:
#print("Requests Per Second Data:")
result = response_rps.json()
print(result)
else:
print("Failed to fetch TTFT data:", response_rps.status_code, response_rps.text)
histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="llama-3-1-8b-pipeline-210"}[5m])) by (le)) * 1000
{'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric': {}, 'values': [[1752505500, '48.45656000000012'], [1752505800, '39.800000000000004'], [1752506100, 'NaN']]}]}}
The following tutorials demonstrate creating metrics data and retrieving it using the Wallaroo MLOps API.