When starting a project, the data scientist focuses on exploration and experimentation, rather than turning the process into an immediate production system. This notebook presents a simplified view of this stage.
Resources
The following resources are used as part of this tutorial:
data
data/seattle_housing_col_description.txt: Describes the columns used as part data analysis.
data/seattle_housing.csv: Sample data of the Seattle, Washington housing market between 2014 and 2015.
code
postprocess.py: Formats the data after inference by the model is complete.
preprocess.py: Formats the incoming data for the model.
simdb.py: A simulated database to demonstrate sending and receiving queries.
wallaroo_client.py: Additional methods used with the Wallaroo instance to create workspaces, etc.
Model Testing: Evaluate different models and determine which is best suited for the problem.
Import Libraries
First we’ll import the libraries we’ll be using to evaluate the data and test different models.
importnumpyasnpimportpandasaspdimportsklearnimportsklearn.ensembleimportxgboostasxgbimportseabornimportmatplotlibimportmatplotlib.pyplotaspltimportsimdb# module for the purpose of this demo to simulate pulling data from a databasematplotlib.rcParams["figure.figsize"] = (12,6)
# ignoring warnings for demonstrationimportwarningswarnings.filterwarnings('ignore')
Retrieve Training Data
For training, we will use the data on all houses sold in this market with the last two years. As a reminder, this data pulled from a simulated database as an example of how to pull from an existing data store.
Only a few columns will be shown for display purposes.
conn = simdb.simulate_db_connection()
tablename = simdb.tablename
query =f"select * from {tablename} where date > DATE(DATE(), '-24 month') AND sale_price is not NULL"print(query)
# read in the datahousing_data = pd.read_sql_query(query, conn)
conn.close()
housing_data.loc[:, ["id", "date", "list_price", "bedrooms", "bathrooms", "sqft_living", "sqft_lot"]]
select * from house_listings where date > DATE(DATE(), '-24 month') AND sale_price is not NULL
id
date
list_price
bedrooms
bathrooms
sqft_living
sqft_lot
0
7129300520
2022-10-05
221900.0
3
1.00
1180
5650
1
6414100192
2022-12-01
538000.0
3
2.25
2570
7242
2
5631500400
2023-02-17
180000.0
2
1.00
770
10000
3
2487200875
2022-12-01
604000.0
4
3.00
1960
5000
4
1954400510
2023-02-10
510000.0
3
2.00
1680
8080
...
...
...
...
...
...
...
...
20518
263000018
2022-05-13
360000.0
3
2.50
1530
1131
20519
6600060120
2023-02-15
400000.0
4
2.50
2310
5813
20520
1523300141
2022-06-15
402101.0
2
0.75
1020
1350
20521
291310100
2023-01-08
400000.0
3
2.50
1600
2388
20522
1523300157
2022-10-07
325000.0
2
0.75
1020
1076
20523 rows × 7 columns
Data transformations
To improve relative error performance, we will predict on log10 of the sale price.
Predict on log10 price to try to improve relative error performance
Now we pick variables and split training data into training and holdout (test).
vars= ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view',
'condition', 'grade', 'sqft_above', 'sqft_basement', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'house_age', 'renovated', 'yrs_since_reno']
outcome ='logprice'runif = np.random.default_rng(2206222).uniform(0, 1, housing_data.shape[0])
gp = np.where(runif <0.2, 'test', 'training')
hd_train = housing_data.loc[gp=='training', :].reset_index(drop=True, inplace=False)
hd_test = housing_data.loc[gp=='test', :].reset_index(drop=True, inplace=False)
# split the training into training and val for xgboostrunif = np.random.default_rng(123).uniform(0, 1, hd_train.shape[0])
xgb_gp = np.where(runif <0.2, 'val', 'train')
# for xgboost, further split into train and valtrain_features = np.array(hd_train.loc[xgb_gp=='train', vars])
train_labels = np.array(hd_train.loc[xgb_gp=='train', outcome])
val_features = np.array(hd_train.loc[xgb_gp=='val', vars])
val_labels = np.array(hd_train.loc[xgb_gp=='val', outcome])
Postprocessing
Since we are fitting a model to predict log10 price, we need to convert predictions back into price units. We also want to round to the nearest dollar.
For the purposes of this demo, let’s say that we require a mean absolute percent error (MAPE) of 15% or less, and the we want to try a few models to decide which model we want to use.
One could also hyperparameter tune at this stage; for brevity, we’ll omit that in this demo.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.