Now that we have decided on the type and structure of the model from Stage 1: Data Exploration And Model Selection, this notebook modularizes the various steps of the process in a structure that is compatible with production and with Wallaroo.
We have pulled the preprocessing and postprocessing steps out of the training notebook into individual scripts that can also be used when the model is deployed.
Assuming no changes are made to the structure of the model, this notebook, or a script based on this notebook, can then be scheduled to run on a regular basis, to refresh the model with more recent training data. We’d expect to run this notebook in conjunction with the Stage 3 notebook, 03_deploy_model.ipynb. For clarity in this demo, we have split the training/upload task into two notebooks, 02_automated_training_process.ipynb and 03_deploy_model.ipynb.
Resources
The following resources are used as part of this tutorial:
data
data/seattle_housing_col_description.txt: Describes the columns used as part data analysis.
data/seattle_housing.csv: Sample data of the Seattle, Washington housing market between 2014 and 2015.
code
postprocess.py: Formats the data after inference by the model is complete.
preprocess.py: Formats the incoming data for the model.
simdb.py: A simulated database to demonstrate sending and receiving queries.
wallaroo_client.py: Additional methods used with the Wallaroo instance to create workspaces, etc.
Note that this connection is simulated to demonstrate how data would be retrieved from an existing data store. For training, we will use the data on all houses sold in this market with the last two years.
importnumpyasnpimportpandasaspdimportsklearnimportxgboostasxgbimportseabornimportmatplotlibimportmatplotlib.pyplotaspltimportpickleimportsimdb# module for the purpose of this demo to simulate pulling data from a databasefrompreprocessimport create_features # our custom preprocessingfrompostprocessimport postprocess # our custom postprocessingmatplotlib.rcParams["figure.figsize"] = (12,6)
# ignoring warnings for demonstrationimportwarningswarnings.filterwarnings('ignore')
conn = simdb.simulate_db_connection()
tablename = simdb.tablename
query =f"select * from {tablename} where date > DATE(DATE(), '-24 month') AND sale_price is not NULL"print(query)
# read in the datahousing_data = pd.read_sql_query(query, conn)
conn.close()
housing_data.loc[:, ["id", "date", "list_price", "bedrooms", "bathrooms", "sqft_living", "sqft_lot"]]
select * from house_listings where date > DATE(DATE(), '-24 month') AND sale_price is not NULL
id
date
list_price
bedrooms
bathrooms
sqft_living
sqft_lot
0
7129300520
2022-10-05
221900.0
3
1.00
1180
5650
1
6414100192
2022-12-01
538000.0
3
2.25
2570
7242
2
5631500400
2023-02-17
180000.0
2
1.00
770
10000
3
2487200875
2022-12-01
604000.0
4
3.00
1960
5000
4
1954400510
2023-02-10
510000.0
3
2.00
1680
8080
...
...
...
...
...
...
...
...
20518
263000018
2022-05-13
360000.0
3
2.50
1530
1131
20519
6600060120
2023-02-15
400000.0
4
2.50
2310
5813
20520
1523300141
2022-06-15
402101.0
2
0.75
1020
1350
20521
291310100
2023-01-08
400000.0
3
2.50
1600
2388
20522
1523300157
2022-10-07
325000.0
2
0.75
1020
1076
20523 rows × 7 columns
Data transformations
To improve relative error performance, we will predict on log10 of the sale price.
Predict on log10 price to try to improve relative error performance
# split data into training and testoutcome ='logprice'runif = np.random.default_rng(2206222).uniform(0, 1, housing_data.shape[0])
gp = np.where(runif <0.2, 'test', 'training')
hd_train = housing_data.loc[gp=='training', :].reset_index(drop=True, inplace=False)
hd_test = housing_data.loc[gp=='test', :].reset_index(drop=True, inplace=False)
# split the training into training and val for xgboostrunif = np.random.default_rng(123).uniform(0, 1, hd_train.shape[0])
xgb_gp = np.where(runif <0.2, 'val', 'train')
Based on the experimentation and testing performed in Stage 1: Data Exploration And Model Selection, XGBoost was selected as the ML model and the variables for training were selected. The model will be generated and tested against sample data.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
This step converts the model to onnx for easy import into Wallaroo.
importonnxfromonnxmltools.convertimport convert_xgboost
fromskl2onnx.common.data_typesimport FloatTensorType, DoubleTensorType
importpreprocess# set the number of columnsncols =len(preprocess._vars)
# derive the opset valuefromonnx.defsimport onnx_opset_version
fromonnxconverter_common.onnx_eximport DEFAULT_OPSET_NUMBER
TARGET_OPSET =min(DEFAULT_OPSET_NUMBER, onnx_opset_version())
# Convert the model to onnxonnx_model_converted = convert_xgboost(xgb_model, 'tree-based classifier',
[('input', FloatTensorType([None, ncols]))],
target_opset=TARGET_OPSET)
# Save the modelonnx.save_model(onnx_model_converted, "housing_model_xgb.onnx")
With the model trained and ready, we can now go to Stage 3: Deploy the Model in Wallaroo.