LLM Metrics Retrieval Example
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Metrics Retrieval for Llama 3 8B Instruct Tutorial
The following tutorial demonstrates using the Wallaroo MLOps API to retrieve Wallaroo metrics data for a Llama v3 8b model. These requests are compliant with Prometheus API endpoints.
This tutorial demonstrates pulling metrics information for a previously deployed a LLM deployed with OpenAI Compatibility in Wallaroo.
For access to these sample models and for a demonstration:
- Contact your Wallaroo Support Representative OR
- Schedule Your Wallaroo.AI Demo Today
Prerequisites
This tutorial assumes the following:
- A Wallaroo Ops environment is installed.
- The Wallaroo SDK is installed. These examples use the Wallaroo SDK to generate the initial inferences information for the metrics requests.
Tutorial Steps
This part of the tutorial generates the inference results used for the rest of the tutorial.
Import libraries
The first step is to import the libraries required. This includes the Wallaroo SDK.
import json
import numpy as np
import pandas as pd
import pytz
import datetime
import requests
from requests.auth import HTTPBasicAuth
import wallaroo
Connect to the Wallaroo Instance
A connection to Wallaroo is established via the Wallaroo client. The Python library is included in the Wallaroo install and available through the Jupyter Hub interface provided with your Wallaroo environment.
This is accomplished using the wallaroo.Client()
command, which provides a URL to grant the SDK permission to your specific Wallaroo environment. When displayed, enter the URL into a browser and confirm permissions. Store the connection into a variable that can be referenced later.
If logging into the Wallaroo instance through the internal JupyterHub service, use wl = wallaroo.Client()
. For more information on Wallaroo Client settings, see the Client Connection guide.
wl = wallaroo.Client()
Wallaroo Dashboard Metrics Retrieval via the Wallaroo MLOps API
The Wallaroo MLOps API allows for metrics retrieval. These are used to track:
- Inference result performance.
- Deployed replicas.
- Inference Latency.
These inference endpoints are compliant with Prometheus endpoints.
Supported Queries
The following queries are supported through the Metrics endpoints. The following references are used here:pipelineID
: The pipeline’s numerical identifier, retrieved from the Wallaroo SDK withwallaroo.pipeline.Pipeline.name()
. For example:pipeline.name()
sample-pipeline-name
deployment_id
: The Kubernetes namespace for the deployment.
English Name | Parameterized Query | Example Query | Description |
---|---|---|---|
Requests per second | sum by (pipeline_name) (rate(latency_histogram_ns_count{pipeline_name="{pipelineID}"}[{step}s])) | sum by (deploy_id) (rate(latency_histogram_ns_count{deploy_id="deployment_id"}[10s])) | Number of processed requests per second to a pipeline. |
Cluster inference rate | sum by (pipeline_name) (rate(tensor_throughput_batch_count{pipeline_name="{pipelineID}"}[{step}s])) | sum by (deploy_id) (rate(tensor_throughput_batch_count{deploy_id="deployment_id"}[10s])) | Number of inferences processed per second. This notably differs from requests per second when batch inference requests are made. |
P50 inference latency | histogram_quantile(0.50, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6 | histogram_quantile(0.50, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6 | Histogram for P90 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
P95 inference latency | histogram_quantile(0.95, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6 | histogram_quantile(0.95, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6 | Histogram for P95 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
P99 inference latency | histogram_quantile(0.99, sum(rate(latency_histogram_ns_bucket{{deploy_id="{deploy_id}"}}[{step_interval}])) by (le)) / 1e6 | histogram_quantile(0.99, sum(rate(latency_histogram_ns_bucket{deploy_id="deployment_id"}[10s])) by (le)) / 1e6 | Histogram for P99 total inference time spent per message in an engine, includes transport to and from the sidekick in the case there is one. |
Engine replica count | count(container_memory_usage_bytes{namespace="{pipeline_namespace}", container="engine"}) or vector(0) | count(container_memory_usage_bytes{namespace="deployment_id", container="engine"}) or vector(0) | Number of engine replicas currently running in a pipeline |
Sidekick replica count | count(container_memory_usage_bytes{namespace="{pipeline_namespace}", container=~"engine-sidekick-.*"}) or vector(0) | count(container_memory_usage_bytes{namespace="deployment_id", container=~"engine-sidekick-.*"}) or vector(0) | Number of sidekick replicas currently running in a pipeline |
Output tokens per second (TPS) | sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) | sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="deployment_id"}[10s])) | LLM output tokens per second: this is the number of tokens generated per second for a LLM deployed in Wallaroo with vLLM |
P99 Time to first token (TTFT) | histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000 | histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000 | P99 time to first token: P99 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |
P95 Time to first token (TTFT) | histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000 | histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000 | P95 time to first token: P95 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |
P50 Time to first token (TTFT) | histogram_quantile(0.50, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="{pipeline_namespace}"}[{step_interval}])) by (le)) * 1000 | histogram_quantile(0.50, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="deployment_id"}[10s])) by (le)) * 1000 | P50 time to first token: P50 for time to generate the first token for LLMs deployed in Wallaroo with vLLM |
Query Metric Request Endpoints
- Endpoints:
/v1/api/metrics/query
(GET)/v1/api/metrics/query
(POST)
For full details, see the Wallaroo MLOps API Reference Guide
Query Metric Request Parameters
Parameter | Type | Description |
---|---|---|
query | String | The Prometheus expression query string. |
time | String | The evaluation timestamp in either RFC3339 format or Unix timestamp. |
timeout | String | The evaluation timeout in duration format (5m for 5 minutes, etc). |
Query Metric Request Returns
Field | Type | Description | |
---|---|---|---|
status | String | The status of the request of either success or error . | |
data | Dict | The response data. | |
data.resultType | String | The type of query result. | |
data.result | String | DateTime of the model’s creation. | |
errorType | String | The error type if status is error . | |
errorType | String | The error messages if status is error . | |
warnings | Array[String] | An array of error messages. |
Query Range Metric Endpoints
- Endpoints
/v1/api/metrics/query_range
(GET)/v1/api/metrics/query_range
(POST)
Returns a list of models added to a specific workspace.
Query Range Metric Request Parameters
Parameter | Type | Description |
---|---|---|
query | String | The Prometheus expression query string. |
start | String | The starting timestamp in either RFC3339 format or Unix timestamp, inclusive. |
end | String | The ending timestamp in either RFC3339 format or Unix timestamp. |
step | String | Query resolution step width in either duration format or as a float number of seconds. |
timeout | String | The evaluation timeout in duration format (5m for 5 minutes, etc). |
Query Range Metric Request Returns
Field | Type | Description | |
---|---|---|---|
status | String | The status of the request of either success or error . | |
data | Dict | The response data. | |
resultType | String | The type of query result. For query range, always matrix . | |
result | String | DateTime of the model’s creation. | |
errorType | String | The error type if status is error . | |
errorType | String | The error messages if status is error . | |
warnings | Array[String] | An array of error messages. |
TTFT Metrics Example
The following request shows an example of a Query Range request for requests per second. For this example, the following Wallaroo SDK methods are used:
wl.api_endpoint
: Retrieves the API endpoint for the Wallaroo Ops server.wl.auth.auth_header()
: Retrieves the authentication bearer tokens.
TTFT Query Example
The following example uses the P99 Time to first token (TTFT) query.
For this example, we set the following:
- Data start and data end periods
- Steps of the calculation
- The name and deployment of the Wallaroo pipeline the LLM is deployed in.
# this will also format the timezone in the parsing section
timezone = "US/Central"
selected_timezone = pytz.timezone(timezone)
# Define the start and end times of 10:00 to 10:15
data_start = selected_timezone.localize(datetime.datetime(2025, 7, 14, 10, 0, 0))
data_end = selected_timezone.localize(datetime.datetime(2025, 7, 14, 10, 15, 00))
# this is the URL to get prometheus metrics
query_url = f"{wl.api_endpoint}/v1/metrics/api/v1/query_range"
import time
# Retrieve the token
headers = wl.auth.auth_header()
# Convert to UTC and get the Unix timestamps
start_timestamp = int(data_start.astimezone(pytz.UTC).timestamp())
end_timestamp = int(data_end.astimezone(pytz.UTC).timestamp())
pipeline_name = "llama-3-1-8b-pipeline" # the name of the pipeline
deploy_id = 210 # the deployment id
step = "5m" # the step of the calculation
query_ttft = f'histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{{kubernetes_namespace="{pipeline_name}-{deploy_id}"}}[{step}])) by (le)) * 1000'
print(query_ttft)
#request parameters
params_ttft = {
'query': query_ttft,
'start': start_timestamp,
'end': end_timestamp,
'step': step
}
response_rps = requests.get(query_url, headers=headers, params=params_ttft)
if response_rps.status_code == 200:
#print("Requests Per Second Data:")
result = response_rps.json()
print(result)
else:
print("Failed to fetch TTFT data:", response_rps.status_code, response_rps.text)
histogram_quantile(0.99, sum(rate(vllm:time_to_first_token_seconds_bucket{kubernetes_namespace="llama-3-1-8b-pipeline-210"}[5m])) by (le)) * 1000
{'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric': {}, 'values': [[1752505500, '48.45656000000012'], [1752505800, '39.800000000000004'], [1752506100, 'NaN']]}]}}
Output tokens per second (TPS)
This example uses the
TTFT Query Example
The following example uses the Output tokens per second (TPS).
For this example, we set the following:
- Data start and data end periods
- Steps of the calculation
- The name and deployment of the Wallaroo pipeline the LLM is deployed in.
# this will also format the timezone in the parsing section
timezone = "US/Central"
selected_timezone = pytz.timezone(timezone)
# Define the start and end times of 10:00 to 10:15
data_start = selected_timezone.localize(datetime.datetime(2025, 7, 14, 10, 0, 0))
data_end = selected_timezone.localize(datetime.datetime(2025, 7, 14, 10, 15, 00))
# this is the URL to get prometheus metrics
query_url = f"{wl.api_endpoint}/v1/metrics/api/v1/query_range"
import time
# Retrieve the token
headers = wl.auth.auth_header()
# Convert to UTC and get the Unix timestamps
start_timestamp = int(data_start.astimezone(pytz.UTC).timestamp())
end_timestamp = int(data_end.astimezone(pytz.UTC).timestamp())
pipeline_name = "llama-3-1-8b-pipeline" # the name of the pipeline
deploy_id = 210 # the deployment id
step = "5m" # the step of the calculation
query_tps = f'sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{{kubernetes_namespace="{pipeline_name}-{deploy_id}"}}[{step}]))'
print(query_tps)
#request parameters
params_ttft = {
'query': query_tps,
'start': start_timestamp,
'end': end_timestamp,
'step': step
}
response_rps = requests.get(query_url, headers=headers, params=params_ttft)
if response_rps.status_code == 200:
#print("Requests Per Second Data:")
result = response_rps.json()
print(result)
else:
print("Failed to fetch TTFT data:", response_rps.status_code, response_rps.text)
sum by (kubernetes_namespace) (rate(vllm:generation_tokens_total{kubernetes_namespace="llama-3-1-8b-pipeline-210"}[5m]))
{'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric': {'kubernetes_namespace': 'llama-3-1-8b-pipeline-210'}, 'values': [[1752505200, '0'], [1752505500, '0.6707186440677967'], [1752505800, '0.6779661016949152'], [1752506100, '0']]}]}}