Francisca Dias
For this analysis I will use the proportion of response variance accounted for as a criterion for evaluating regression performance in the King County housing study. This criterion is also known as the coefficient of determination or R-squared.
We compute its value in the test set by squaring the correlation of observed and predicted response values.
The dataset represents 21,613 King County Houses. We later split the data into 14,491 training observations and 7,122 test observations.
Relationships between pairs of variables in this problem are shown in the correlation heatmap, which orders explanatory variables (features) by their correlation with the response variable, the log Price, in US Dollars.
From the correlation heatmap we can see the importance of sqft_living and grade in explaining housing values.
When a response variable is positively skewed (as is the median value of homes), regressing the log of the response on the linear predictor often provides a better fitting model than regressing the original response on the linear predictor. That is the reason why our response variable, price, will be converted to log.
A more closer look at the correlation heatmap also reveals strong correlations among predictors.
Correlations among predictors is called multicollinearity and we should pay attention in regression contexts.
I used a model, called "my_model" where I included a polynomial regression. Therefore I can introduce the concept of multicollinearity by using the square and cube of a sqft_living.
This model, which will serve as a baseline for comparing other regression models in this problem, explains 52 percent of response variance in the training set and the same 52 percent in the test set. The response in this problem is the natural logarithm of home prices.
I will analyse various regression models.
I begin with tree-structured regression using the some of the features that I consider relevant. This model fits the training and test set well, explaining 55 percent of response variance in the training set and 53 percent on the test set.
If I include the set of variables from the original model plus a few log variables, to tree-structured regression, the model performs better, by explaining 57 percent of response variance in the training set, and 55 percent in the test set.
Finally I approach the Random Forest that is an ensemble learning method for classification and regression (our case). Fitting the data from the original model, it performs very well on the training set, explaining 91 percent of response variance, but explains only 50 percent on the test set. This big gap suggest over-fitting.
If I include the set of variables from the original model plus a few log variables into the Random Forest it makes a big difference on the test set, where now we get 93 percent on the training set and 59 percent on the test set.
This dataset can be found here.
import warnings
warnings.filterwarnings("ignore")
from __future__ import division, print_function
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import numpy as np # arrays and math functions
from scipy.stats import uniform # for training-and-test split
import statsmodels.api as sm # statistical models (including regression)
import statsmodels.formula.api as smf # R-like model specification
from sklearn.tree import DecisionTreeRegressor # machine learning tree
from sklearn.ensemble import RandomForestRegressor # ensemble method
pd.set_option('display.float_format', lambda x: '%.3f' % x)
houses_df = pd.read_csv("kc_house_data.csv")
houses_df['idx'] = range(len(houses_df)) # for use as index
houses = houses_df.set_index(['idx'])
houses.columns
del houses["id"]
del houses["date"]
houses.columns
total = houses.isnull().sum().sort_values(ascending=False)
percent = (houses.isnull().sum()/houses.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
There are no missing values in this dataset.
Let´s have a first look of the main statistics of the variable we are trying to predict: house prices.
houses['price'].describe()
There is some informartion we can interpret here:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
mu = houses['price'].mean()
sigma = houses['price'].std()
plt.xlabel('House Price')
plt.ylabel('Probability Density')
plt.title(r'$\mathrm{Distribution\ of\ Price:}\ \mu=%.3f,\ \sigma=%.3f$'%(mu,sigma))
sns.distplot(houses['price']);
houses.grade.value_counts()
Instead of describing all statistical attributes of features, it would be easier to plot them in boxplots. They will give us the median, lower and upper quartile and outliers for each values that these features take.
Let´s see the distributions between the following features: floors, waterfront, view, condition and grade.
# Suplots of categorical features v price
sns.set_style('white')
f, axes = plt.subplots(2,2, figsize = (15,15))
# Plot [0,0]
sns.boxplot(data = houses, x = 'floors', y = 'price', ax = axes[0,0])
axes[0,0].set_xlabel('floors')
axes[0,0].set_ylabel('Price')
axes[0,0].set_title('Floors v Price')
# Plot [0,1]
sns.boxplot(x = 'waterfront', y = 'price', data = houses, ax = axes[0,1])
axes[0,1].set_xlabel('waterfront')
axes[0,1].set_ylabel('waterfront')
axes[0,1].set_title('Waterfront v Price')
# Plot [1,0]
sns.boxplot(x = 'view', y = 'price', data = houses, ax = axes[1,0])
axes[1,0].set_xlabel('view')
axes[1,0].set_ylabel('Price')
axes[1,0].set_title('view v Price')
# Plot [1,1]
sns.boxplot(x = 'condition', y = 'price', data = houses, ax = axes[1,1])
axes[1,1].set_xlabel('condition')
axes[1,1].set_ylabel('Price')
axes[1,1].set_title('condition v Price');
The feature Grade takes on more values, therefore I will plot this variable in a another graph.
sns.set(style="white")
f, ax = plt.subplots(figsize=(10, 8))
fig = sns.boxplot(x='grade', y="price", data=houses, color="c")
fig.axis(ymin=0, ymax=7000000);
### floors, waterfront, view, condition, grade not log
### more correlated :
# computed variables for linear model used by my Model
houses['log_value'] = np.log(houses['price'])
houses['sqft_living_squared'] = np.power(houses['sqft_living'], 2)
houses['sqft_living_cubed'] = np.power(houses['sqft_living'], 3)
houses['log_sqft_lot'] = np.log(houses['sqft_lot'])
houses['log_sqft_above'] = np.log(houses['sqft_above'])
houses['log_sqft_living15'] = np.log(houses['sqft_living15'])
houses['log_sqft_lot15'] = np.log(houses['sqft_lot15'])
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="white")
# Compute the correlation matrix
corr = houses.corr()
f, ax = plt.subplots(figsize=(19, 14))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, annot=True, cmap=cmap, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
# structure of my model for baseline for comparisons
my_model = 'log_value ~ sqft_living + sqft_living_squared + \
sqft_living_cubed + log_sqft_lot + log_sqft_above + log_sqft_living15 + \
log_sqft_lot15 '
# for comparison lets look at a simple model with the original variables
simple_model = 'log_value ~ bedrooms + bathrooms + grade + sqft_above + \
sqft_living15 '
# original variables plus variables that add value for trees
# that is... variables that are not simple monotonic transformations
# of the original explanatory variables
full_model = 'log_value ~ bedrooms + bathrooms + grade + sqft_above + sqft_living15 + \
log_sqft_lot + log_sqft_above + log_sqft_living15 + log_sqft_lot15'
I will split the data into training and test by creating a new column "runiform" that will take values, randomly, from 0 to 1. Then I am imposing a rule in which values above 0.33 wil belong to my train dataset, and values below will be to test set.
Each dataset will still have 21 features, but train set will gave 14491 observations, whereas test will have only 7122 observations.
# employ training-and-test regimen for model validation
np.random.seed(4444)
houses['runiform'] = uniform.rvs(loc = 0, scale = 1, size = len(houses))
houses_train = houses[houses['runiform'] >= 0.33]
houses_test = houses[houses['runiform'] < 0.33]
# check training data frame
print('\nhouses_selected_train data frame (rows, columns): ',houses_train.shape)
print(houses_train.head())
# check test data frame
print('\nhouses_selected_test data frame (rows, columns): ',houses_test.shape)
print(houses_test.head())
# examine the correlations across the variables before we begin modeling
houses_train_df_vars = houses_train.loc[ : ,['log_value', 'bedrooms','condition', 'grade', 'log_sqft_lot', 'log_sqft_living15']]
print(houses_train_df_vars.corr())
houses.columns
#scatterplot
sns.set()
sns.pairplot(houses_train_df_vars, size = 2.5)
plt.show();
# --------------------------------------------
# Linear regression a my model
# --------------------------------------------
# fit the model to the training set
my_model_train_fit = smf.ols(my_model, data = houses_train).fit()
# summary of model fit to the training set
print(my_model_train_fit.summary())
# training set predictions from the model fit to the training set
houses_train['predict_log_value'] = my_model_train_fit.fittedvalues
# test set predictions from the model fit to the training set
houses_test['predict_log_value'] = my_model_train_fit.predict(houses_test)
# compute the proportion of response variance for training data
my_modely_train_result = round(np.power(houses_train['log_value'].corr(houses_train['predict_log_value']),2),3)
print('\nMy model Proportion of Training Set Variance Accounted for: ',my_modely_train_result)
# compute the proportion of response variance
# accounted for when predicting out-of-sample
my_model_test_result = \
round(np.power(houses_test['log_value']\
.corr(houses_test['predict_log_value']),2),3)
print('\nMy model Proportion of Test Set Variance Accounted for: ',\
my_model_test_result)
# --------------------------------------
# Tree-structured regression (simple)
# --------------------------------------
# try tree-structured regression on the original explantory variables
# note that one of the advantages of trees is no need for transformations
# of the explanatory variables... sklearn DecisionTreeRegressor
tree_model_maker = DecisionTreeRegressor(random_state = 9999, max_depth = 5)
y_train = houses_train.loc[:, ['log_value']]
# simple model has five predictors
X_train_simple = houses_train.loc[:, \
['bedrooms', 'bathrooms', 'grade', 'sqft_above', 'sqft_living15']]
X_test_simple = houses_test.loc[:, \
['bedrooms', 'bathrooms', 'grade', 'sqft_above', 'sqft_living15']]
tree_model_fit = tree_model_maker.fit(X_train_simple, y_train)
# compute the proportion of response variance for training data
houses_train['simple_tree_predict_log_value'] =\
tree_model_fit.predict(X_train_simple)
simple_tree_train_result = \
round(np.power(houses_train['log_value']\
.corr(houses_train['simple_tree_predict_log_value']),2),3)
print('\nSimple Tree Proportion of Training Set Variance Accounted for: ',\
simple_tree_train_result)
# compute the proportion of response variance for test data
houses_test['simple_tree_predict_log_value'] =\
tree_model_fit.predict(X_test_simple)
simple_tree_test_result = \
round(np.power(houses_test['log_value']\
.corr(houses_test['simple_tree_predict_log_value']),2),3)
print('\nSimple Tree Proportion of Test Set Variance Accounted for: ',\
simple_tree_test_result)
# --------------------------------------
# Tree-structured regression (full)
# --------------------------------------
# same method as for simple tree
tree_model_maker = DecisionTreeRegressor(random_state = 9999, max_depth = 5)
y_train = houses_train.loc[:, ['log_value']]
# full model has more predictors
X_train_full = houses_train.loc[:, ['bedrooms', 'bathrooms', 'grade', 'sqft_above',\
'sqft_living15', 'log_sqft_lot', 'log_sqft_above', 'log_sqft_living15', 'log_sqft_lot15']]
X_test_full = houses_test.loc[:, ['bedrooms', 'bathrooms', 'grade', 'sqft_above',\
'sqft_living15', 'log_sqft_lot', 'log_sqft_above', 'log_sqft_living15', 'log_sqft_lot15']]
tree_model_fit = tree_model_maker.fit(X_train_full, y_train)
# compute the proportion of response variance for training data
houses_train['full_tree_predict_log_value'] =tree_model_fit.predict(X_train_full)
full_tree_train_result = round(np.power(houses_train['log_value'].corr(houses_train['full_tree_predict_log_value']),2),3)
print('\nFull Tree Proportion of Training Set Variance Accounted for: ',full_tree_train_result)
# compute the proportion of response variance for test data
houses_test['full_tree_predict_log_value'] = tree_model_fit.predict(X_test_full)
full_tree_test_result = round(np.power(houses_test['log_value'].corr(houses_test['full_tree_predict_log_value']),2),3)
print('\nFull Tree Proportion of Test Set Variance Accounted for: ',full_tree_test_result)
# --------------------------------------
# Random forests (simple)
# --------------------------------------
rf_model_maker = RandomForestRegressor(random_state = 9999)
y_train = houses_train.loc[:, ['log_value']]
# simple model has more predictors
X_train_simple = houses_train.loc[:, ['bedrooms', 'bathrooms', 'grade', 'sqft_above', 'sqft_living15']]
X_test_simple = houses_test.loc[:, ['bedrooms', 'bathrooms', 'grade', 'sqft_above', 'sqft_living15']]
rf_model_fit = rf_model_maker.fit(X_train_simple, y_train)
# compute the proportion of response variance for training data
houses_train['simple_rf_predict_log_value'] = rf_model_fit.predict(X_train_simple)
simple_rf_train_result = round(np.power(houses_train['log_value'].corr(houses_train['simple_rf_predict_log_value']),2),3)
print('\nSimple Random Forest Prop Training Set Variance Accounted for: ',simple_rf_train_result)
# compute the proportion of response variance for test data
houses_test['simple_rf_predict_log_value'] = rf_model_fit.predict(X_test_simple)
simple_rf_test_result = round(np.power(houses_test['log_value'].corr(houses_test['simple_rf_predict_log_value']),2),3)
print('\nSimple Random Forest Prop of Test Set Variance Accounted for: ',simple_rf_test_result)
# --------------------------------------
# Random forests (full)
# --------------------------------------
rf_model_maker = RandomForestRegressor(random_state = 9999)
y_train = houses_train.loc[:, ['log_value']]
# full model has more predictors
X_train_full = houses_train.loc[:, ['bedrooms', 'bathrooms', 'grade', 'sqft_above','sqft_living15', 'log_sqft_lot', 'log_sqft_above', 'log_sqft_living15', 'log_sqft_lot15']]
X_test_full = houses_test.loc[:, ['bedrooms', 'bathrooms', 'grade', 'sqft_above','sqft_living15', 'log_sqft_lot', 'log_sqft_above', 'log_sqft_living15', 'log_sqft_lot15']]
rf_model_fit = rf_model_maker.fit(X_train_full, y_train)
# compute the proportion of response variance for training data
houses_train['full_rf_predict_log_value'] = rf_model_fit.predict(X_train_full)
full_rf_train_result = round(np.power(houses_train['log_value'].corr(houses_train['full_rf_predict_log_value']),2),3)
print('\nFull Random Forest Prop of Training Set Variance Accounted for: ',full_rf_train_result)
# compute the proportion of response variance for test data
houses_test['full_rf_predict_log_value'] = rf_model_fit.predict(X_test_full)
full_rf_test_result = round(np.power(houses_test['log_value'].corr(houses_test['full_rf_predict_log_value']),2),3)
print('\nFull Random Forest Prop of Test Set Variance Accounted for: ',full_rf_test_result)
# --------------------------------------
# Gather results for a single report
# --------------------------------------
# measurement model performance summary
table_data = {'method' : ['Linear regression my Model (1997)',\
'Tree-structured regression (simple model)',\
'Tree-structured regression (full model)',\
'Random forests (simple model)',\
'Random forests (full model)'],\
'Training Set Result' : [my_modely_train_result,\
simple_tree_train_result,\
full_tree_train_result,\
simple_rf_train_result,\
full_rf_train_result
],\
'Test Set Result' : [my_model_test_result,\
simple_tree_test_result,\
full_tree_test_result,\
simple_rf_test_result,\
full_rf_test_result
]}
table_data_frame = pd.DataFrame(table_data,\
columns = ['method', 'Training Set Result', 'Test Set Result'])
print(table_data_frame)