Statsmodel Forecast with Wallaroo Features: Model Creation
This tutorial and the assets can be downloaded as part of the Wallaroo Tutorials repository.
Statsmodel Forecast with Wallaroo Features: Model Creation
This tutorial series demonstrates how to use Wallaroo to create a Statsmodel forecasting model based on bike rentals. This tutorial series is broken down into the following:
- Create and Train the Model: This first notebook shows how the model is trained from existing data.
- Deploy and Sample Inference: With the model developed, we will deploy it into Wallaroo and perform a sample inference.
- Parallel Infer: A sample of multiple weeks of data will be retrieved and submitted as an asynchronous parallel inference. The results will be collected and uploaded to a sample database.
- External Connection: A sample data connection to Google BigQuery to retrieve input data and store the results in a table.
- ML Workload Orchestration: Take all of the previous steps and automate the request into a single Wallaroo ML Workload Orchestration.
Prerequisites
- A Wallaroo instance version 2023.2.1 or greater.
References
- Wallaroo SDK Essentials Guide: Model Uploads and Registrations: Python Models
- Wallaroo SDK Essentials Guide: Pipeline Management
- Wallaroo SDK Essentials: Inference Guide: Parallel Inferences
import pandas as pd
import datetime
import os
from statsmodels.tsa.arima.model import ARIMA
from resources import simdb as simdb
Train the Model
The resources to train the model will start with the local file day.csv
. This data is load and prepared for use in training the model.
For this example, the simulated database is controled by the resources simbdb
.
def mk_dt_range_query(*, tablename: str, seed_day: str) -> str:
assert isinstance(tablename, str)
assert isinstance(seed_day, str)
query = f"select cnt from {tablename} where date > DATE(DATE('{seed_day}'), '-1 month') AND date <= DATE('{seed_day}')"
return query
conn = simdb.get_db_connection()
# create the query
query = mk_dt_range_query(tablename=simdb.tablename, seed_day='2011-03-01')
print(query)
# read in the data
training_frame = pd.read_sql_query(query, conn)
training_frame
select cnt from bikerentals where date > DATE(DATE('2011-03-01'), '-1 month') AND date <= DATE('2011-03-01')
cnt | |
---|---|
0 | 1526 |
1 | 1550 |
2 | 1708 |
3 | 1005 |
4 | 1623 |
5 | 1712 |
6 | 1530 |
7 | 1605 |
8 | 1538 |
9 | 1746 |
10 | 1472 |
11 | 1589 |
12 | 1913 |
13 | 1815 |
14 | 2115 |
15 | 2475 |
16 | 2927 |
17 | 1635 |
18 | 1812 |
19 | 1107 |
20 | 1450 |
21 | 1917 |
22 | 1807 |
23 | 1461 |
24 | 1969 |
25 | 2402 |
26 | 1446 |
27 | 1851 |
Test the Forecast
The training frame is then loaded, and tested against our forecast
model.
# test
import forecast
import json
# create the appropriate json
jsonstr = json.dumps(training_frame.to_dict(orient='list'))
print(jsonstr)
forecast.wallaroo_json(jsonstr)
{"cnt": [1526, 1550, 1708, 1005, 1623, 1712, 1530, 1605, 1538, 1746, 1472, 1589, 1913, 1815, 2115, 2475, 2927, 1635, 1812, 1107, 1450, 1917, 1807, 1461, 1969, 2402, 1446, 1851]}
{‘forecast’: [1764, 1749, 1743, 1741, 1740, 1740, 1740]}
Reload New Model
The forecast
model is reloaded in preparation of creating the evaluation data.
import importlib
importlib.reload(forecast)
<module 'forecast' from '/home/jovyan/pipeline_multiple_replicas_forecast_tutorial/forecast.py'>
Prepare evaluation data
For ease of inference, we save off the evaluation data to a separate json file.
# save off the evaluation frame json, too
import json
with open("./data/testdata_dict.json", "w") as f:
json.dump(training_frame.to_dict(orient='list'), f)