Wallaroo SDK Essentials Guide: Pipeline Log Management

How to create and manage Wallaroo Pipelines through the Wallaroo SDK

Retrieve Inference Request Logs Via the Wallaroo SDK

Pipeline have their own set of log files that are retrieved and analyzed as needed with the either through:

The Pipeline logs method (returns either a DataFrame or Apache Arrow).
The Pipeline export_logs method (saves either a DataFrame file in JSON format, or an Apache Arrow file).

Get Pipeline Logs

Pipeline logs are retrieved through the Pipeline logs method. By default, logs are returned as a DataFrame in reverse chronological order of insertion, with the most recent files displayed first.

Pipeline logs are segmented by pipeline versions. For example, if a new model step is added to a pipeline, a model swapped out of a pipeline step, etc - this generated a new pipeline version. log method requests will return logs based on the parameter that match the pipeline version. To request logs of a specific pipeline version, specify the start_datetime and end_datetime parameters based on the pipeline version logs requested.

IMPORTANT NOTE

Pipeline logs are returned either in reverse or forward chronological order of record insertion; depending on when a specific inference request completes, one inference record may be inserted out of chronological order by the Timestamp value, but still be in chronological order of insertion.

This command takes the following parameters.

Parameter	Type	Description
`limit`	Int (Optional) (Default: 100)	Limits how many log records to display. If there are more pipeline logs than are being displayed, the Warning message `Pipeline log record limit exceeded` will be displayed. For example, if 100 log files were requested and there are a total of 1,000, the warning message will be displayed.
`start_datetime` and `end_datetime`	DateTime (Optional)	Limits logs to all logs between the `start_datetime` and `end_datetime` DateTime parameters. These comply with the Python datetime library for formats such as: `datetime.datetime.now()` `datetime.datetime(2023, 3, 28, 14, 25, 51, 660058, tzinfo=tzutc())` (March 28, 2023 14:25:51:660058 UTC time zone) Both parameters must be provided. Submitting a `logs()` request with only `start_datetime` or `end_datetime` will generate an exception. If `start_datetime` and `end_datetime` are provided as parameters even with any other parameter, then the records are returned in chronological order, with the oldest record displayed first.
`dataset`	ListString]** (OPTIONAL)	The datasets to be returned. The datasets available are: ``: Default. This translates to `["time", "in", "out", "anomaly"]`. `time`: The DateTime of the inference request. `in`: All inputs listed as `in_{variable_name}`. `out`: All outputs listed as `out_variable_name`. `anomaly`: Flags whether an Anomaly was detected. Anomalies are detected from each pipeline validation returned `True`. For full details, see Wallaroo SDK Essentials Guide: Anomaly Detection. The following fields are included in this dataset. `count` The number of anomalies detected as an integer. Each pipeline validation the returns `True` adds to the number of anomalies detected. `{validation}`: Each pipeline validation added to the pipeline is returned as the field `anomaly.{validation}`. Validations that return `True` indicate an anomaly detected based on the validation expression, while `False` indicates no anomaly found for the validation. `meta`: Returns metadata. IMPORTANT NOTE: See Metadata Requests Restrictions* for specifications on how this dataset can be used with other datasets. Returns in the `metadata.elapsed` field: A list of time in nanoseconds for: The time to serialize the input. How long each step took. Returns in the `metadata.last_model` field: A dict with each Python step as: `model_name`: The name of the model in the pipeline step. `model_sha` : The sha hash of the model in the pipeline step. Returns in the `metadata.partition` field: The partition used to store the inference results from this pipeline. This is mainly used when adding Wallaroo Server edge deployments to a published pipeline and separating the inference results from those edge deployments. See Wallaroo SDK Essentials Guide: Pipeline Edge Publication: Edge Observability for full details. Returns in the `metadata.pipeline_version` field: The pipeline version as a UUID value. `metadata.elapsed`: IMPORTANT NOTE: See [Metadata Requests Restrictionsfor specifications on how this dataset can be used with other datasets. Returns in the `metadata.elapsed` field: A list of time in nanoseconds for: The time to serialize the input. How long each step took.
`dataset_exclude`	List[String] (OPTIONAL)	Exclude specified datasets.
`dataset_separator`	Sequence[[String], string] (OPTIONAL)	If set to “.”, return dataset will be flattened.
`arrow`	Boolean (Optional) (Default: `False`)	If `arrow` is set to `True`, then the logs are returned as an Apache Arrow table. If `arrow=False`, then the logs are returned as a pandas DataFrame.

All of the parameters can be used together, but start_datetime and end_datetime must be combined; if one is used, then so must the other. If start_datetime and end_datetime are used with any other parameter, then the log results are in chronological order of record insertion.

Log requests are limited to around 100k in size. For requests greater than 100k in size, use the Pipeline export_logs() method.

Logs include the following standard datasets:

Parameter	Type	Description
`time`	DateTime	The DateTime the inference request was made.
`in.{variable}`		The input(s) for the inference request. Each input is listed as `in.{variable_name}`. For example, `in.text_input`, `in.square_foot`, `in.number_of_rooms`, etc.
`out`		The outputs(s) for the inference request, based on the ML model’s outputs. Each output is listed as `out.{variable_name}`. For example, `out.maximum_offer_price`, `out.minimum_asking_price`, `out.trade_in_value`, etc.
`anomaly.count`	Int	How many validation checks were triggered by the inference. For more information, see Wallaroo SDK Essentials Guide: Anomaly Detection
`out_{model_name}.{variable}`		Only returned when using Pipeline Shadow Deployments. For each model in the shadow deploy step, their output is listed in the format `out_{model_name}.{variable}`. For example, `out_shadow_model_xgb.maximum_offer_price`, `out_shadow_model_xgb.minimum_asking_price`, `out_shadow_model_xgb.trade_in_value`, etc.
`out._model_split`		Only returned when using A/B Testing, used to display the `model_name`, `model_version`, and `model_sha` of the model used for the inference.

In this example, the last 50 logs to the pipeline mainpipeline between two sample dates. In this case, all of the time column fields are the same since the inference request was sent as a batch.

logs = mainpipeline.logs(start_datetime=date_start, end_datetime=date_end)

display(len(logs))
display(logs)

538

	time	in.tensor	out.variable	anomaly.count
0	2023-04-24 18:09:33.970	[4.0, 2.5, 2900.0, 5505.0, 2.0, 0.0, 0.0, 3.0, 8.0, 2900.0, 0.0, 47.6063, -122.02, 2970.0, 5251.0, 12.0, 0.0, 0.0]	[718013.75]	0
1	2023-04-24 18:09:33.970	[2.0, 2.5, 2170.0, 6361.0, 1.0, 0.0, 2.0, 3.0, 8.0, 2170.0, 0.0, 47.7109, -122.017, 2310.0, 7419.0, 6.0, 0.0, 0.0]	[615094.56]	0
2	2023-04-24 18:09:33.970	[3.0, 2.5, 1300.0, 812.0, 2.0, 0.0, 0.0, 3.0, 8.0, 880.0, 420.0, 47.5893, -122.317, 1300.0, 824.0, 6.0, 0.0, 0.0]	[448627.72]	0
3	2023-04-24 18:09:33.970	[4.0, 2.5, 2500.0, 8540.0, 2.0, 0.0, 0.0, 3.0, 9.0, 2500.0, 0.0, 47.5759, -121.994, 2560.0, 8475.0, 24.0, 0.0, 0.0]	[758714.2]	0
4	2023-04-24 18:09:33.970	[3.0, 1.75, 2200.0, 11520.0, 1.0, 0.0, 0.0, 4.0, 7.0, 2200.0, 0.0, 47.7659, -122.341, 1690.0, 8038.0, 62.0, 0.0, 0.0]	[513264.7]	0
…	…	…	…	…
533	2023-04-24 18:09:33.970	[3.0, 2.5, 1750.0, 7208.0, 2.0, 0.0, 0.0, 3.0, 8.0, 1750.0, 0.0, 47.4315, -122.192, 2050.0, 7524.0, 20.0, 0.0, 0.0]	[311909.6]	0
534	2023-04-24 18:09:33.970	[5.0, 1.75, 2330.0, 6450.0, 1.0, 0.0, 1.0, 3.0, 8.0, 1330.0, 1000.0, 47.4959, -122.367, 2330.0, 8258.0, 57.0, 0.0, 0.0]	[448720.28]	0
535	2023-04-24 18:09:33.970	[4.0, 3.5, 4460.0, 16271.0, 2.0, 0.0, 2.0, 3.0, 11.0, 4460.0, 0.0, 47.5862, -121.97, 4540.0, 17122.0, 13.0, 0.0, 0.0]	[1208638.0]	0
536	2023-04-24 18:09:33.970	[3.0, 2.75, 3010.0, 1842.0, 2.0, 0.0, 0.0, 3.0, 9.0, 3010.0, 0.0, 47.5836, -121.994, 2950.0, 4200.0, 3.0, 0.0, 0.0]	[795841.06]	0
537	2023-04-24 18:09:33.970	[2.0, 1.5, 1780.0, 4750.0, 1.0, 0.0, 0.0, 4.0, 7.0, 1080.0, 700.0, 47.6859, -122.395, 1690.0, 5962.0, 67.0, 0.0, 0.0]	[558463.3]	0

538 rows × 4 columns

Metadata Requests Restrictions

The following restrictions are in place when requesting the following datasets:

metadata
metadata.elasped
metadata.last_model
metadata.pipeline_version

Standard Pipeline Steps Log Requests

Effected pipeline steps:

add_model_step
replace_with_model_step

For log file requests, the following metadata dataset requests for standard pipeline steps are available:

metadata

These must be paired with specific columns. * is not available when paired with metadata.

in: All input fields.
out: All output fields.
time: The DateTime the inference request was made.
in.{input_fields}: Any input fields (tensor, etc.)
out.{output_fields}: Any output fields (out.house_price, out.variable, etc.)
anomaly.count: Any anomalies detected from validations.
anomaly.{validation}: The validation that triggered the anomaly detection and whether it is True (indicating an anomaly was detected) or False. For more details, see Wallaroo SDK Essentials Guide: Anomaly Detection

For example, the following requests the metadata plus any output fields.

metadatalogs = mainpipeline.logs(dataset=["out","metadata"])
display(metadatalogs.loc[:, ['out.variable', 'metadata.last_model']])

	out.variable	metadata.last_model
0	[581003.0]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}
1	[706823.56]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}
2	[1060847.5]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}
3	[441960.38]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}
4	[827411.0]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}

Shadow Deploy Testing Pipeline Steps

Effected pipeline steps:

add_shadow_deploy
replace_with_shadow_deploy

For log file requests, the following metadata dataset requests for shadow deploy testing pipeline steps are available:

metadata

These must be paired with specific columns. * is not available when paired with metadata. time must be included if dataset is used.

in: All input fields.
out: All output fields.
time: The DateTime the inference request was made.
in.{input_fields}: Any input fields (tensor, etc.).
out.{output_fields}: Any output fields matching the specific output_field (out.house_price, out.variable, etc.).
out_: All shadow deployed challenger steps Any output fields matching the specific output_field (out.house_price, out.variable, etc.).
anomaly.count: Any anomalies detected from validations.
anomaly.{validation}: The validation that triggered the anomaly detection and whether it is True (indicating an anomaly was detected) or False. For more details, see Wallaroo SDK Essentials Guide: Anomaly Detection

The following example retrieves the logs from a pipeline with shadow deployed models, and displays the specific shadow deployed model outputs and the metadata.elasped field.

# Display metadata

metadatalogs = mainpipeline.logs(dataset=["out_logcontrolchallenger01.variable", 
                                          "out_logcontrolchallenger02.variable", 
                                          "metadata"
                                          ]
                                )

display(metadatalogs.loc[:, ['out_logcontrolchallenger01.variable',	
                             'out_logcontrolchallenger02.variable', 
                             'metadata.elapsed'
                             ]
                        ])

	out_logcontrolchallenger01.variable	out_logcontrolchallenger02.variable	metadata.elapsed
0	[573391.1]	[596933.5]	[302804, 26900]
1	[663008.75]	[594914.2]	[302804, 26900]
2	[1520770.0]	[1491293.8]	[302804, 26900]
3	[381577.16]	[411258.3]	[302804, 26900]
4	[743487.94]	[787589.25]	[302804, 26900]

A/B Deploy Testing Pipeline Steps

Effected pipeline steps:

add_random_split
replace_with_random_split

For log file requests, the following metadata dataset requests for A/B testing pipeline steps are available:

metadata

These must be paired with specific columns. * is not available when paired with metadata. time must be included if dataset is used.

in: All input fields.
out: All output fields.
time: The DateTime the inference request was made. Must be requested in all dataset requests.
in.{input_fields}: Any input fields (tensor, etc.).
out.{output_fields}: Any output fields matching the specific output_field (out.house_price, out.variable, etc.).
anomaly.count: Any anomalies detected from validations.
anomaly.{validation}: The validation that triggered the anomaly detection and whether it is True (indicating an anomaly was detected) or False. For more details, see Wallaroo SDK Essentials Guide: Anomaly Detection

The following example retrieves the logs from a pipeline with A/B deployed models, and displays the output and the specific metadata.last_model field.

metadatalogs = mainpipeline.logs(dataset=["time",
                                          "out", 
                                          "metadata"
                                          ]
                                )

display(metadatalogs.loc[:, ['out.variable', 'metadata.last_model']])

	out.variable	metadata.last_model
0	[581003.0]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}
1	[706823.56]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}
2	[1060847.5]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}
3	[441960.38]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}
4	[827411.0]	{“model_name”:“logcontrol”,“model_sha”:“e22a0831aafd9917f3cc87a15ed267797f80e2afa12ad7d8810ca58f173b8cc6”}

Export Pipeline Logs as File

The Pipeline method export_logs returns the Pipeline records as either by default pandas records in Newline Delimited JSON (NDJSON) format, or an Apache Arrow table files.

The output files are by default stores in the current working directory ./logs with the default prefix as the {pipeline name}-1, {pipeline name}-2, etc.

IMPORTANT NOTE

Files with the same names will be overwritten.

The suffix by default will be json for pandas records in Newline Delimited JSON (NDJSON) format files. Logs are segmented by pipeline version across the limit, data_size_limit, or start_datetime and end_datetime parameters.

By default, logs are returned as a pandas record in NDJSON in reverse chronological order of insertion, with the most recent log insertions displayed first.

Pipeline logs are segmented by pipeline versions. For example, if a new model step is added to a pipeline, a model swapped out of a pipeline step, etc - this generated a new pipeline version.

IMPORTANT NOTE

This command takes the following parameters.

Parameter	Type	Description
`directory`	String (Optional) (Default: `logs`)	Logs are exported to a file from current working directory to `directory`.
`file_prefix`	String (Optional) (Default: The name of the pipeline)	The name of the exported files. By default, this will the name of the pipeline and is segmented by pipeline version between the limits or the start and end period. For example: ’logpipeline-1.json`, etc.
`data_size_limit`	String (Optional) (Default: `100MB`)	The maximum size for the exported data in bytes. Note that file size is approximate to the request; a request of `10MiB` may return 10.3MB of data. The fields are in the format “{size as number} {unit value}”, and can include a space so “10 MiB” and “10MiB” are the same. The accepted unit values are: `KiB` (for KiloBytes) `MiB` (for MegaBytes) `GiB` (for GigaBytes) `TiB` (for TeraBytes)
`limit`	Int (Optional) (Default: `100`)	Limits how many log records to display. Defaults to `100`. If there are more pipeline logs than are being displayed, the Warning message `Pipeline log record limit exceeded` will be displayed. For example, if 100 log files were requested and there are a total of 1,000, the warning message will be displayed.
`start_datetime` and `end_datetime`	DateTime (Optional)	Limits logs to all logs between the `start_datetime` and `end_datetime` DateTime parameters. These comply with the Python datetime library for formats such as: `datetime.datetime.now()` `datetime.datetime(2023, 3, 28, 14, 25, 51, 660058, tzinfo=tzutc())` (March 28, 2023 14:25:51:660058 UTC time zone) Both parameters must be provided. Submitting a `logs()` request with only `start_datetime` or `end_datetime` will generate an exception. If `start_datetime` and `end_datetime` are provided as parameters even with any other parameter, then the records are returned in chronological order, with the oldest record displayed first.
`filename`	String (Required)	The file name to save the log file to. The requesting user must have write access to the file location. The requesting user must have write permission to the file location, and the target directory for the file must already exist. For example: If the file is set to `/var/wallaroo/logs/pipeline.json`, then the directory `/var/wallaroo/logs` must already exist. Otherwise file names are only limited by standard file naming rules for the target environment.
`dataset`	List (OPTIONAL)	The datasets to be returned. The datasets available are: ``: Default. This translates to `"time", "in", "out", "anomaly"]`. `time`: The DateTime of the inference request. `in`: All inputs listed as `in_{variable_name}`. `out`: All outputs listed as `out_variable_name`. `anomaly`: Flags whether an Anomaly was detected. Anomalies are detected from each pipeline validation returned `True`. For full details, see Wallaroo SDK Essentials Guide: Anomaly Detection. The following fields are included in this dataset. `count` The number of anomalies detected as an integer. Each pipeline validation the returns `True` adds to the number of anomalies detected. `{validation}`: Each pipeline validation added to the pipeline is returned as the field `anomaly.{validation}`. Validations that return `True` indicate an anomaly detected based on the validation expression, while `False` indicates no anomaly found for the validation. `meta`: Returns metadata. IMPORTANT NOTE: See Metadata Requests Restrictions* for specifications on how this dataset can be used with other datasets. Returns in the `metadata.elapsed` field: A list of time in nanoseconds for: The time to serialize the input. How long each step took. Returns in the `metadata.last_model` field: A dict with each Python step as: `model_name`: The name of the model in the pipeline step. `model_sha` : The sha hash of the model in the pipeline step. Returns in the `metadata.partition` field: The partition used to store the inference results from this pipeline. This is mainly used when adding Wallaroo Server edge deployments to a published pipeline and separating the inference results from those edge deployments. See Wallaroo SDK Essentials Guide: Pipeline Edge Publication: Edge Observability for full details. Returns in the `metadata.pipeline_version` field: The pipeline version as a UUID value. `metadata.elapsed`: IMPORTANT NOTE: See [Metadata Requests Restrictionsfor specifications on how this dataset can be used with other datasets. Returns in the `metadata.elapsed` field: A list of time in nanoseconds for: The time to serialize the input. How long each step took.
`dataset_exclude`	List[String] (OPTIONAL)	Exclude specified datasets.
`dataset_separator`	Sequence[[String], string] (OPTIONAL)	If set to “.”, return dataset will be flattened.
`arrow`	Boolean (Optional)	Defaults to False. If `arrow=True`, then the logs are returned as an Apache Arrow table. If `arrow=False`, then the logs are returned as pandas record in NDJSON that can be imported into a pandas DataFrame.

File sizes are limited to around 10 MB in size. If the requested log file is greater than 10 MB, a Warning will be displayed indicating the end date of the log file downloaded so the request can be adjusted to capture the requested log files.

IMPORTANT NOTE

DataFrame file exports exported as pandas record in NDJSON are read back to a DataFrame through the the pandas read_json method with the parameter lines=True. For example:

data_df = pd.read_json("mainpipeline_logs.df.json", lines=True)
display(data_df)

In this example, the log files are saved as both Pandas DataFrame and Apache Arrow.

# Save the DataFrame version of the log file

mainpipeline.export_logs()
display(os.listdir('./logs'))

mainpipeline.export_logs(arrow=True)
display(os.listdir('./logs'))

    Warning: There are more logs available. Please set a larger limit to export more data.
    

    ['pipeline-logs-1.json']

    Warning: There are more logs available. Please set a larger limit to export more data.
    

    ['pipeline-logs-1.arrow', 'pipeline-logs-1.json']

Pipeline Log Storage

Pipeline logs have a set allocation of storage space and data requirements.

Pipeline Log Storage Warnings

To prevent storage and performance issues, inference result data may be dropped from pipeline logs by the following standards:

Columns are progressively removed from the row starting with the largest input data size and working to the smallest, then the same for outputs.

For example, Computer Vision ML Models typically have large inputs and output values - a single pandas DataFrame inference request may be over 13 MB in size, and the inference results nearly as large. To prevent pipeline log storage issues, the input may be dropped from the pipeline logs, and if additional space is needed, the inference outputs would follow. The time column is preserved.

IMPORTANT NOTE

Inference Requests will always return all inputs, outputs, and other metadata unless specifically requested for exclusion. It is the pipeline logs that may drop columns for space purposes.

If a pipeline has dropped columns for space purposes, this will be displayed when a log request is made with the following warning, with {columns} replaced with the dropped columns.

The inference log is above the allowable limit and the following columns may have been suppressed for various rows in the logs: {columns}. To review the dropped columns for an individual inference’s suppressed data, include dataset=["metadata"] in the log request.

Review Dropped Columns

To review what columns are dropped from pipeline logs for storage reasons, include the dataset metadata in the request to view the column metadata.dropped. This metadata field displays a List of any columns dropped from the pipeline logs.

For example:

metadatalogs = mainpipeline.logs(dataset=["time", "metadata"])

	time	metadata.dropped
0	2023-07-06	15:47:03.673
1	2023-07-06	15:47:03.673
2	2023-07-06	15:47:03.673
3	2023-07-06	15:47:03.673
4	2023-07-06	15:47:03.673
…	…	…
95	2023-07-06	15:47:03.673
96	2023-07-06	15:47:03.673
97	2023-07-06	15:47:03.673
98	2023-07-06	15:47:03.673
99	2023-07-06	15:47:03.673

Suppressed Data Elements

Data elements that do not fit the supported data types below, such as None or Null values, are not supported in pipeline logs. When present, undefined data will be written in the place of the null value, typically zeroes. Any null list values will present an empty list.