# Kaggle House prices advanced regression techniques - Ridge Regression

The objective of this post is to train a linear model with scikit-learn in order to predict housing prices. The target audience is people that have some experiene with data science and want to learn more of the basics.

I also wanted to show how to handle categorical and numeric variables in sklearn in a way that simplifies implementing a model in production.

I personally preffer using the OneHotEncoder class instead of pandas.get_dummies as it’s much simpler to handle unknown categories.

`from helpers import *`

```
df = pd.read_csv('data/train.csv')
df.drop(columns="Id", inplace=True)
df.head()
```

MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 60 | RL | 65.000 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.000 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.000 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |

1 | 20 | RL | 80.000 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.000 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.000 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |

2 | 60 | RL | 68.000 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.000 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.000 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |

3 | 70 | RL | 60.000 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.000 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.000 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |

4 | 60 | RL | 84.000 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.000 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |

## Problem Definition

The objective of the housing prices competition is to predict the price of houses using historical data. This is a great problem to learn about data science as the data has multiple numeric and categorical features and multiple steps are required to achieve a good result.

### Target Variable Distribution

The first step for this problem is to understand the target variable, also known as outcome variable or class. In this case we are predicting a numeric value (housing prices) we consider this as a supervised regression problem. - By plotting a histogram we can easily find most housing prices rely between the range of 50000 and 400000. I - It’s also clear from the plot there are some extreme values in the data. These could be mistakes in the data generation/collection (someone perhaps just added another zero!?) or simply these values are correct but are just rare. - Depending on the models we use this is a problem or not, for example linear models will be very sensitive to these extreme values. But it shoudln’t be a problem for tree based models for example - It’s also aparent from the histogram that most of the values occur to the left of the distribution. This is called positive skewness in statistics.

`ggplot(df, aes(x="SalePrice")) + geom_histogram()`

<ggplot: (-9223363266190582300)>

- One typical transformation that is used on numeric variables is a log transform. In some cases it helps aliviate the positive skewness of the target variable.
- After applying the transformation we can see the histogram looks closer to a normal distribution than before and this is normally better for most models, specially for linear models.
- We don’t need to delete any points so far, at least there isn’t a justification to do so.

```
df['log_price'] = np.log(df.SalePrice)
ggplot(df, aes(x="log_price")) + geom_histogram()
```

<ggplot: (-9223363266190630954)>

### Q-Q plots.

Another way to explore the outcome distribution is with Q-Q plots. This is a direct way of comparing the theoretical quantiles of a normal distribution (red line) with the empirical distribution, concretely the available data. - The plot on the left is the raw data which we can see the empirical quantiles (blue dots) are quite far from the red line. - The plot on the right shows after the log transformation the data is much closer to the theoretical normal distribution. Still there are some values that are a bit far from the line in the lower side of the distribution. This means the problematic observations might be the “cheaper” houses rather than the most expensive ones.

```
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10,5))
stats.probplot(df['SalePrice'], plot=ax1)
stats.probplot(df['log_price'], plot=ax2)
ax1.title.set_text('Q-Q plot Sale Price')
ax2.title.set_text('Q-Q plot log(Sale Price)')
plt.show()
```

### Define a list with categorical and numeric variables

```
columns = setdiff(list(df.columns), ["SalePrice", "log_price"])
categorical_vars = list(df.select_dtypes('object').columns)
numeric_vars = setdiff(columns, categorical_vars)
```

### NA values

We can see there are some variables with almost 99% of the observations as NA values, still I would rather leave them in as predictors. Of course a variable that has 99.5% missing values won’t be extremely useful, but perhaps these variables help explain the 0.5% of the cases very well. For example the PoolQC variable perhaps is a great predictor of why houses with a pool cost more. Therefore by converting it to a dummy variable it’s possible to leave it in as a predictor but still not overfit the data. As this is a categorical variable, a simple way of doing this is creating a new category called “missing”.

```
na_amount = count_na(df)
na_amount.head(10)
```

Miss | Percent | |
---|---|---|

PoolQC | 1453 | 99.521 |

MiscFeature | 1406 | 96.301 |

Alley | 1369 | 93.767 |

Fence | 1179 | 80.753 |

FireplaceQu | 690 | 47.260 |

LotFrontage | 259 | 17.740 |

GarageYrBlt | 81 | 5.548 |

GarageType | 81 | 5.548 |

GarageQual | 81 | 5.548 |

GarageCond | 81 | 5.548 |

`df[categorical_vars] = df[categorical_vars].fillna("missing", inplace=True)`

## Define a pipeline to train a ridge regression

In this dataset we have categorical variables, now with no NA values and numeric variables with NA values. Each type of variable requires a different preprocessing.

- numeric variables: impute missing values, then standardize (convert to zero mean and 1 standard deviation).
- categorical variables: Need to be represented as dummy variables also known as one-hot-encoding or 1-k encoding.

#### Preprocessing

- For numeric variables I create a scikit-learn
`pipeline`

with the`SimpleImputer`

and`StandardScaler`

classes. - For categorical variables I simply use the
`OneHotEncoder`

class

```
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
pipe_num = make_pipeline(imputer, StandardScaler())
preprocessing = make_column_transformer(
(pipe_num, numeric_vars),
(OneHotEncoder(handle_unknown='ignore'), categorical_vars))
```

Now we have sorted out the preprocessing of two different data types (numeric and categorical variables).

This is actually a recent approach of solving this problem in scikit-learn, it’s much more elegant than it previously was.

Finally I define the model pipeline. So in this case we created one pipeline for preprocessing the data and used it as the first element of the “model” pipeline that includes a feature selection criteria and the `Ridge`

model.

`pipe = make_pipeline(preprocessing, SelectPercentile(f_regression), Ridge())`

### Parameters definition

The `make_pipeline`

class creates a `Pipeline`

object where the element of the pipeline is named by default with the element’s class in lower case.
So the `SelectPercentile`

element of the pipeline is names as `selectpercentile`

, `Ridge`

is named as `ridge`

.

Then if we want to do use the `GridSearchCV`

we need to attach the class name in lowercase, two __ and the parameter name to a `dict`

, for example selectpercentile__percentile and ridge__alpha.

```
alphas = np.logspace(-4, 4, 30)
params = {
"selectpercentile__percentile": [10, 20, 30, 40, 50, 100],
"ridge__alpha": alphas
}
y = df.log_price.values
```

## Cross-validation

Now we are ready to evaluate the model and perform cross-validation.

In regression problems I find the mean absolute error more intuitive to evaluate prediction error.

In most cases the direction of mean squared error and mean absolute error is the same.

```
grid = GridSearchCV(pipe, params, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
grid.fit(df, y)
```

GridSearchCV(cv=5, error_score=‘raise-deprecating’, estimator=Pipeline(memory=None, steps=[(‘columntransformer’, ColumnTransformer(n_jobs=None, remainder=‘drop’, sparse_threshold=0.3, transformer_weights=None, transformers=[(‘pipeline’, Pipeline(memory=None, steps=[(‘simpleimputer’, SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=na… 2.59294380e+00, 4.89390092e+00, 9.23670857e+00, 1.74332882e+01, 3.29034456e+01, 6.21016942e+01, 1.17210230e+02, 2.21221629e+02, 4.17531894e+02, 7.88046282e+02, 1.48735211e+03, 2.80721620e+03, 5.29831691e+03, 1.00000000e+04]), ’selectpercentile__percentile’: [10, 20, 30, 40, 50, 100]}, pre_dispatch=’2*n_jobs’, refit=True, return_train_score=False, scoring=‘neg_mean_absolute_error’, verbose=0)

## Cross-validation error

As we trained the model to predict the log of the sale price the MAE we get is on log( price ) units. This is fine, and is of course different to the metric used in the kaggle competition that is root mean squared log error.

I preffer using MAE on the log scale that is a similar metric but one I can easily interpret. In this case we get a 0.098 MAE on the log scale.

`print("MAE: {}".format(grid.best_score_ * -1))`

MAE: 0.09882329827572524

This error metric seems still fairly abstract. We are trying to predict house prices, so how much money does the 0.098 MAE represent!?

Let’s assume we have a property of 500000 usd, if I substract 0.098 on the log scale I get 453324, this is a 46676 usd difference. This represents quite a bit of money in the end…

```
dif = np.exp(np.log(500000) - 0.098)
dif
```

453324.4518769599

`500000 - 453324`

46676

Let’s try one more thing. What happens if I compare the value left as a difference after applying the -0.098 in the log space. This shows a similar value as a percent change as what we substracted in the log space.

`dif / 500000 - 1`

-0.09335109624608029

To conclude we can think of a 0.098 MAE as a -+ error of around 46000 usd.

## Reviewing model parameters

- It’s trivial to find the best parameters of the cross-validation procedure, this can be done accesing the
`best_params_`

attribute of the`grid`

element. - However I believe plotting all the testing results helps understand what worked and what didn’t. This is what will give you a hint on how to continue improving the model.

`grid.best_params_`

{’ridge__alpha’: 117.21022975334793, ’selectpercentile__percentile’: 100}

The plot below is done with the `plotnine`

library. This is a `ggplot2`

(an R visualization library) port to Python.

It’s great for exploratory analysis. A tutorial can be found in this post.

```
results = pd.DataFrame(grid.cv_results_)
results['percentile'] = results['param_selectpercentile__percentile'].astype("category")
results['log_alpha'] = results.param_ridge__alpha.apply(lambda x: np.log10(x))
results['mae'] = results.mean_test_score * -1
(ggplot(results, aes(x='log_alpha', y='mae', color='percentile')) +
geom_line() +
xlab("log(alpha)") +
ylab("CV - Mean Absolute Error") +
ggtitle("Cross-Validation ") + scale_color_brewer(type='qual', palette=2) +
theme_bw()
)
```

<ggplot: (-9223363266191251747)>

- The plot shows that selecting the best 50% of the features give the same result as selecting all the features.
- This means that half the features didn’t provide anything useful to the model predictive power.
- Normally in ML we preffer smaller models with less variables in order to avoid overfitting the data.

`results.groupby("percentile")['mae'].mean()`

percentile 10 0.127 20 0.117 30 0.114 40 0.107 50 0.106 100 0.106 Name: mae, dtype: float64

To run the notebook you can download the code from github.