Data Exploration And Model Selection

Stage 1: Data Exploration And Model Selection

When starting a project, the data scientist focuses on exploration and experimentation, rather than turning the process into an immediate production system. This notebook presents a simplified view of this stage.

Resources

The following resources are used as part of this tutorial:

  • data
    • data/seattle_housing_col_description.txt: Describes the columns used as part data analysis.
    • data/seattle_housing.csv: Sample data of the Seattle, Washington housing market between 2014 and 2015.
  • code
    • postprocess.py: Formats the data after inference by the model is complete.
    • preprocess.py: Formats the incoming data for the model.
    • simdb.py: A simulated database to demonstrate sending and receiving queries.
    • wallaroo_client.py: Additional methods used with the Wallaroo instance to create workspaces, etc.

Steps

The following steps are part of this process:

Import Libraries

First we’ll import the libraries we’ll be using to evaluate the data and test different models.

import numpy as np
import pandas as pd

import sklearn
import sklearn.ensemble

import xgboost as xgb

import seaborn
import matplotlib
import matplotlib.pyplot as plt

import simdb # module for the purpose of this demo to simulate pulling data from a database

matplotlib.rcParams["figure.figsize"] = (12,6)

Retrieve Training Data

For training, we will use the data on all houses sold in this market with the last two years. As a reminder, this data pulled from a simulated database as an example of how to pull from an existing data store.

conn = simdb.simulate_db_connection()
tablename = simdb.tablename

query = f"select * from {tablename} where date > DATE(DATE(), '-24 month') AND sale_price is not NULL"
print(query)
# read in the data
housing_data = pd.read_sql_query(query, conn)

conn.close()
housing_data
select * from house_listings where date > DATE(DATE(), '-24 month') AND sale_price is not NULL
id date list_price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 sale_price
0 7129300520 2022-03-06 221900.0 3 1.00 1180 5650 1.0 0 0 ... 1180 0 1955 0 98178 47.5112 -122.257 1340 5650 221900.0
1 6414100192 2022-05-02 538000.0 3 2.25 2570 7242 2.0 0 0 ... 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639 538000.0
2 5631500400 2022-07-19 180000.0 2 1.00 770 10000 1.0 0 0 ... 770 0 1933 0 98028 47.7379 -122.233 2720 8062 180000.0
3 2487200875 2022-05-02 604000.0 4 3.00 1960 5000 1.0 0 0 ... 1050 910 1965 0 98136 47.5208 -122.393 1360 5000 604000.0
4 1954400510 2022-07-12 510000.0 3 2.00 1680 8080 1.0 0 0 ... 1680 0 1987 0 98074 47.6168 -122.045 1800 7503 510000.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20518 263000018 2021-10-12 360000.0 3 2.50 1530 1131 3.0 0 0 ... 1530 0 2009 0 98103 47.6993 -122.346 1530 1509 360000.0
20519 6600060120 2022-07-17 400000.0 4 2.50 2310 5813 2.0 0 0 ... 2310 0 2014 0 98146 47.5107 -122.362 1830 7200 400000.0
20520 1523300141 2021-11-14 402101.0 2 0.75 1020 1350 2.0 0 0 ... 1020 0 2009 0 98144 47.5944 -122.299 1020 2007 402101.0
20521 291310100 2022-06-09 400000.0 3 2.50 1600 2388 2.0 0 0 ... 1600 0 2004 0 98027 47.5345 -122.069 1410 1287 400000.0
20522 1523300157 2022-03-08 325000.0 2 0.75 1020 1076 2.0 0 0 ... 1020 0 2008 0 98144 47.5941 -122.299 1020 1357 325000.0

20523 rows × 22 columns

Data transformations

To improve relative error performance, we will predict on log10 of the sale price.

Predict on log10 price to try to improve relative error performance

housing_data['logprice'] = np.log10(housing_data.sale_price)

From the data, we will create the following features to evaluate:

  • house_age: How old the house is.
  • renovated: Whether the house has been renovated or not.
  • yrs_since_reno: If the house has been renovated, how long has it been.
import datetime

thisyear = datetime.datetime.now().year

housing_data['house_age'] = thisyear - housing_data['yr_built']
housing_data['renovated'] =  np.where((housing_data['yr_renovated'] > 0), 1, 0) 
housing_data['yrs_since_reno'] =  np.where(housing_data['renovated'], housing_data['yr_renovated'] - housing_data['yr_built'], 0)

housing_data.loc[:, ['yr_built', 'yr_renovated', 'house_age', 'renovated', 'yrs_since_reno']]
yr_built yr_renovated house_age renovated yrs_since_reno
0 1955 0 67 0 0
1 1951 1991 71 1 40
2 1933 0 89 0 0
3 1965 0 57 0 0
4 1987 0 35 0 0
... ... ... ... ... ...
20518 2009 0 13 0 0
20519 2014 0 8 0 0
20520 2009 0 13 0 0
20521 2004 0 18 0 0
20522 2008 0 14 0 0

20523 rows × 5 columns

Now we pick variables and split training data into training and holdout (test).

vars = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view',
'condition', 'grade', 'sqft_above', 'sqft_basement', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'house_age', 'renovated', 'yrs_since_reno']

outcome = 'logprice'

runif = np.random.default_rng(2206222).uniform(0, 1, housing_data.shape[0])
gp = np.where(runif < 0.2, 'test', 'training')

hd_train = housing_data.loc[gp=='training', :].reset_index(drop=True, inplace=False)
hd_test = housing_data.loc[gp=='test', :].reset_index(drop=True, inplace=False)

# split the training into training and val for xgboost
runif = np.random.default_rng(123).uniform(0, 1, hd_train.shape[0])
xgb_gp = np.where(runif < 0.2, 'val', 'train')
# for xgboost, further split into train and val
train_features = np.array(hd_train.loc[xgb_gp=='train', vars])
train_labels = np.array(hd_train.loc[xgb_gp=='train', outcome])

val_features = np.array(hd_train.loc[xgb_gp=='val', vars])
val_labels = np.array(hd_train.loc[xgb_gp=='val', outcome])

Postprocessing

Since we are fitting a model to predict log10 price, we need to convert predictions back into price units. We also want to round to the nearest dollar.

def postprocess(log10price):
    return np.rint(np.power(10, log10price))

Model testing

For the purposes of this demo, let’s say that we require a mean absolute percent error (MAPE) of 15% or less, and the we want to try a few models to decide which model we want to use.

One could also hyperparameter tune at this stage; for brevity, we’ll omit that in this demo.

XGBoost

First we will test out using a XGBoost model.


xgb_model = xgb.XGBRegressor(
    objective = 'reg:squarederror', 
    max_depth=5, 
    base_score = np.mean(hd_train[outcome])
    )

xgb_model.fit( 
    train_features,
    train_labels,
    eval_set=[(train_features, train_labels), (val_features, val_labels)],
    verbose=False,
    early_stopping_rounds=35
)
print(xgb_model.best_score)
print(xgb_model.best_iteration)
print(xgb_model.best_ntree_limit)
0.07793614689092423
99
100

XGBoost Evaluate on holdout

With the sample model created, we will test it against the holdout data. Note that we are calling the postprocess function on the data.

test_features = np.array(hd_test.loc[:, vars])
test_labels = np.array(hd_test.loc[:, outcome])

pframe = pd.DataFrame({
    'pred' : postprocess(xgb_model.predict(test_features)),
    'actual' : postprocess(test_labels)
})

ax = seaborn.scatterplot(
    data=pframe,
    x='pred',
    y='actual',
    alpha=0.2
)
matplotlib.pyplot.plot(pframe.pred, pframe.pred, color='DarkGreen')
matplotlib.pyplot.title("test")
plt.show()

png

pframe['se'] = (pframe.pred - pframe.actual)**2

pframe['pct_err'] = 100*np.abs(pframe.pred - pframe.actual)/pframe.actual
pframe.describe()
pred actual se pct_err
count 4.094000e+03 4.094000e+03 4.094000e+03 4094.000000
mean 5.340824e+05 5.396937e+05 1.657722e+10 12.857674
std 3.413714e+05 3.761666e+05 1.276017e+11 13.512028
min 1.216140e+05 8.200000e+04 1.000000e+00 0.000500
25% 3.167628e+05 3.200000e+05 3.245312e+08 4.252492
50% 4.568700e+05 4.500000e+05 1.602001e+09 9.101485
75% 6.310372e+05 6.355250e+05 6.575385e+09 17.041227
max 5.126706e+06 7.700000e+06 6.637466e+12 252.097895
rmse = np.sqrt(np.mean(pframe.se))
mape = np.mean(pframe.pct_err)

print(f'rmse = {rmse}, mape = {mape}')
rmse = 128752.54982046234, mape = 12.857674005250548

Random Forest

The next model to test is Random Forest.

model_rf = sklearn.ensemble.RandomForestRegressor(n_estimators=100, max_depth=5, n_jobs=2, max_samples=0.8)

train_features = np.array(hd_train.loc[:, vars])
train_labels = np.array(hd_train.loc[:, outcome])

model_rf.fit(train_features, train_labels)
RandomForestRegressor(max_depth=5, max_samples=0.8, n_jobs=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.