10 min read

Kaggle House prices advanced regression techniques - Ridge Regression

Kaggle House prices advanced regression techniques - Ridge Regression

The objective of this post is to train a linear model with scikit-learn in order to predict housing prices. The target audience is people that have some experiene with data science and want to learn more of the basics.

I also wanted to show how to handle categorical and numeric variables in sklearn in a way that simplifies implementing a model in production.

I personally preffer using the OneHotEncoder class instead of pandas.get_dummies as it’s much simpler to handle unknown categories.

from helpers import *
df = pd.read_csv('data/train.csv')
df.drop(columns="Id", inplace=True)
df.head()
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 60 RL 65.000 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.000 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.000 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 20 RL 80.000 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.000 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.000 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 60 RL 68.000 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.000 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.000 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 70 RL 60.000 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.000 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.000 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 60 RL 84.000 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.000 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000

Problem Definition

The objective of the housing prices competition is to predict the price of houses using historical data. This is a great problem to learn about data science as the data has multiple numeric and categorical features and multiple steps are required to achieve a good result.

Target Variable Distribution

The first step for this problem is to understand the target variable, also known as outcome variable or class. In this case we are predicting a numeric value (housing prices) we consider this as a supervised regression problem. - By plotting a histogram we can easily find most housing prices rely between the range of 50000 and 400000. I - It’s also clear from the plot there are some extreme values in the data. These could be mistakes in the data generation/collection (someone perhaps just added another zero!?) or simply these values are correct but are just rare. - Depending on the models we use this is a problem or not, for example linear models will be very sensitive to these extreme values. But it shoudln’t be a problem for tree based models for example - It’s also aparent from the histogram that most of the values occur to the left of the distribution. This is called positive skewness in statistics.

ggplot(df, aes(x="SalePrice")) + geom_histogram()
png

png

<ggplot: (-9223363266190582300)>

  • One typical transformation that is used on numeric variables is a log transform. In some cases it helps aliviate the positive skewness of the target variable.
  • After applying the transformation we can see the histogram looks closer to a normal distribution than before and this is normally better for most models, specially for linear models.
  • We don’t need to delete any points so far, at least there isn’t a justification to do so.
df['log_price'] = np.log(df.SalePrice)
ggplot(df, aes(x="log_price")) + geom_histogram()
png

png

<ggplot: (-9223363266190630954)>

Q-Q plots.

Another way to explore the outcome distribution is with Q-Q plots. This is a direct way of comparing the theoretical quantiles of a normal distribution (red line) with the empirical distribution, concretely the available data. - The plot on the left is the raw data which we can see the empirical quantiles (blue dots) are quite far from the red line. - The plot on the right shows after the log transformation the data is much closer to the theoretical normal distribution. Still there are some values that are a bit far from the line in the lower side of the distribution. This means the problematic observations might be the “cheaper” houses rather than the most expensive ones.

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10,5)) 
stats.probplot(df['SalePrice'], plot=ax1)
stats.probplot(df['log_price'], plot=ax2)
ax1.title.set_text('Q-Q plot Sale Price')
ax2.title.set_text('Q-Q plot log(Sale Price)')
plt.show()
png

png

Define a list with categorical and numeric variables

columns = setdiff(list(df.columns), ["SalePrice", "log_price"])
categorical_vars = list(df.select_dtypes('object').columns)
numeric_vars = setdiff(columns, categorical_vars)

NA values

We can see there are some variables with almost 99% of the observations as NA values, still I would rather leave them in as predictors. Of course a variable that has 99.5% missing values won’t be extremely useful, but perhaps these variables help explain the 0.5% of the cases very well. For example the PoolQC variable perhaps is a great predictor of why houses with a pool cost more. Therefore by converting it to a dummy variable it’s possible to leave it in as a predictor but still not overfit the data. As this is a categorical variable, a simple way of doing this is creating a new category called “missing”.

na_amount = count_na(df)
na_amount.head(10)
Miss Percent
PoolQC 1453 99.521
MiscFeature 1406 96.301
Alley 1369 93.767
Fence 1179 80.753
FireplaceQu 690 47.260
LotFrontage 259 17.740
GarageYrBlt 81 5.548
GarageType 81 5.548
GarageQual 81 5.548
GarageCond 81 5.548
df[categorical_vars] = df[categorical_vars].fillna("missing", inplace=True)

Define a pipeline to train a ridge regression

In this dataset we have categorical variables, now with no NA values and numeric variables with NA values. Each type of variable requires a different preprocessing.

  • numeric variables: impute missing values, then standardize (convert to zero mean and 1 standard deviation).
  • categorical variables: Need to be represented as dummy variables also known as one-hot-encoding or 1-k encoding.

Preprocessing

  • For numeric variables I create a scikit-learn pipeline with the SimpleImputer and StandardScaler classes.
  • For categorical variables I simply use the OneHotEncoder class
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
pipe_num = make_pipeline(imputer, StandardScaler())

preprocessing = make_column_transformer( 
    (pipe_num, numeric_vars),
    (OneHotEncoder(handle_unknown='ignore'), categorical_vars))

Now we have sorted out the preprocessing of two different data types (numeric and categorical variables).

This is actually a recent approach of solving this problem in scikit-learn, it’s much more elegant than it previously was.

Finally I define the model pipeline. So in this case we created one pipeline for preprocessing the data and used it as the first element of the “model” pipeline that includes a feature selection criteria and the Ridge model.

pipe = make_pipeline(preprocessing, SelectPercentile(f_regression), Ridge())

Parameters definition

The make_pipeline class creates a Pipeline object where the element of the pipeline is named by default with the element’s class in lower case. So the SelectPercentile element of the pipeline is names as selectpercentile, Ridge is named as ridge.

Then if we want to do use the GridSearchCV we need to attach the class name in lowercase, two __ and the parameter name to a dict, for example selectpercentile__percentile and ridge__alpha.

alphas = np.logspace(-4, 4, 30)
params = {
    "selectpercentile__percentile": [10, 20, 30, 40, 50, 100],
    "ridge__alpha": alphas
}

y = df.log_price.values

Cross-validation

Now we are ready to evaluate the model and perform cross-validation.

In regression problems I find the mean absolute error more intuitive to evaluate prediction error.

In most cases the direction of mean squared error and mean absolute error is the same.

grid = GridSearchCV(pipe, params, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
grid.fit(df, y)

GridSearchCV(cv=5, error_score=‘raise-deprecating’, estimator=Pipeline(memory=None, steps=[(‘columntransformer’, ColumnTransformer(n_jobs=None, remainder=‘drop’, sparse_threshold=0.3, transformer_weights=None, transformers=[(‘pipeline’, Pipeline(memory=None, steps=[(‘simpleimputer’, SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=na… 2.59294380e+00, 4.89390092e+00, 9.23670857e+00, 1.74332882e+01, 3.29034456e+01, 6.21016942e+01, 1.17210230e+02, 2.21221629e+02, 4.17531894e+02, 7.88046282e+02, 1.48735211e+03, 2.80721620e+03, 5.29831691e+03, 1.00000000e+04]), ’selectpercentile__percentile’: [10, 20, 30, 40, 50, 100]}, pre_dispatch=’2*n_jobs’, refit=True, return_train_score=False, scoring=‘neg_mean_absolute_error’, verbose=0)

Cross-validation error

As we trained the model to predict the log of the sale price the MAE we get is on log( price ) units. This is fine, and is of course different to the metric used in the kaggle competition that is root mean squared log error.

I preffer using MAE on the log scale that is a similar metric but one I can easily interpret. In this case we get a 0.098 MAE on the log scale.

print("MAE: {}".format(grid.best_score_ * -1))

MAE: 0.09882329827572524

This error metric seems still fairly abstract. We are trying to predict house prices, so how much money does the 0.098 MAE represent!?

Let’s assume we have a property of 500000 usd, if I substract 0.098 on the log scale I get 453324, this is a 46676 usd difference. This represents quite a bit of money in the end…

dif = np.exp(np.log(500000) - 0.098)
dif

453324.4518769599

500000 - 453324

46676

Let’s try one more thing. What happens if I compare the value left as a difference after applying the -0.098 in the log space. This shows a similar value as a percent change as what we substracted in the log space.

dif / 500000 - 1

-0.09335109624608029

To conclude we can think of a 0.098 MAE as a -+ error of around 46000 usd.

Reviewing model parameters

  • It’s trivial to find the best parameters of the cross-validation procedure, this can be done accesing the best_params_ attribute of the grid element.
  • However I believe plotting all the testing results helps understand what worked and what didn’t. This is what will give you a hint on how to continue improving the model.
grid.best_params_

{’ridge__alpha’: 117.21022975334793, ’selectpercentile__percentile’: 100}

The plot below is done with the plotnine library. This is a ggplot2 (an R visualization library) port to Python.

It’s great for exploratory analysis. A tutorial can be found in this post.

results = pd.DataFrame(grid.cv_results_)
results['percentile'] = results['param_selectpercentile__percentile'].astype("category")
results['log_alpha'] = results.param_ridge__alpha.apply(lambda x: np.log10(x))
results['mae'] = results.mean_test_score * -1

(ggplot(results, aes(x='log_alpha', y='mae', color='percentile')) + 
  geom_line() +
  xlab("log(alpha)") + 
  ylab("CV - Mean Absolute Error") +
  ggtitle("Cross-Validation ") + scale_color_brewer(type='qual', palette=2) +
  theme_bw()
)
png

png

<ggplot: (-9223363266191251747)>

  • The plot shows that selecting the best 50% of the features give the same result as selecting all the features.
  • This means that half the features didn’t provide anything useful to the model predictive power.
  • Normally in ML we preffer smaller models with less variables in order to avoid overfitting the data.
results.groupby("percentile")['mae'].mean()

percentile 10 0.127 20 0.117 30 0.114 40 0.107 50 0.106 100 0.106 Name: mae, dtype: float64

To run the notebook you can download the code from github.