9 min read

Kaggle restaurant revenue prediction

Kaggle - Restaurant Revenue Prediction - Random Forest

In this post I’ll go over training a random forest for a supervised regression problem.

The objective is to show how to train a simple model and tune the main parameters of a random forest.

We can think this model as a baseline as it requires minimal preprocessing.

from helpers import *

Kaggle provides a two datasets, one to train our model called train and another one called test that we use only to make predictions. In a “real” problem the kaggle test data would represent the “new” data we will be making predictions to.

  • train.csv: Has the data to train our model
  • test.csv: We will use it to generate predictions, in the kaggle competition to make a submission.

Preprocessing

I define a function to preprocess the train and test data. Let’s walk through it line by line. Follow the comments inside the function to understand what each line does.

def preprocess(file_name):
    
    # read the csv file
    df = pd.read_csv(file_name, parse_dates=['Open Date']) 
    
    # convert column names to lower case and replace spaces by a _
    df.columns = fix_names(df) 
    
    # Compute the difference between the max open_date and each restaurant open_date
    df['OpenDays'] = df['open_date'].max() - df['open_date']
    
    # convert this days difference to an integer type
    df['OpenDays'] = df['OpenDays'].astype('timedelta64[D]').astype(int)
    
    # if the csv file_name has a "train" strin in it, create a column with the log(revenue).
    if "train" in file_name:
        df['log_revenue'] = np.log(df.revenue)
        
    # remove Id and open_date columns as they can't be used as predictors   
    df.drop(columns=["id", "open_date"], inplace=True)
    
    return df

Now I apply the function to each csv file

df_train = preprocess(file_name='train.csv')
df_test = preprocess(file_name='test.csv')

df_train.head()
city city_group type p1 p2 p3 p4 p5 p6 p7 p31 p32 p33 p34 p35 p36 p37 revenue OpenDays log_revenue
0 İstanbul Big Cities IL 4 5.0 4.0 4.0 2 2 5 3 4 5 5 4 3 4 5653753.0 5306 15.547830
1 Ankara Big Cities FC 4 5.0 4.0 4.0 1 2 5 0 0 0 0 0 0 0 6923131.0 2172 15.750379
2 Diyarbakır Other IL 2 4.0 2.0 5.0 2 3 5 0 0 0 0 0 0 0 2055379.0 322 14.535971
3 Tokat Other IL 6 4.5 6.0 6.0 4 4 10 12 10 6 18 12 12 6 2675511.0 723 14.799651
4 Gaziantep Other IL 3 4.0 3.0 4.0 2 2 5 1 3 2 3 4 3 3 4316715.0 1722 15.278005

5 rows × 43 columns

What did I accomplish by using a function to preprocess the data?

  • It prevents copy-pasting two times the same code which is inspired in the DRY programming principle, do not repeat yourself.
  • It helps make the code more readable therefore avoiding errors
  • It makes the code reusable

Target variable

It’s clear we have a case of positive skewness in this data, therefore a transformation is probably helpful to make the target variable closer to a normal distribution.

ggplot(df_train, aes(x="revenue")) + geom_histogram(bins=20) + ggtitle("Revenue in USD")
png

png

ggplot(df_train, aes(x="log_revenue")) + geom_histogram(bins=20) + ggtitle("log(revenue)")
png

png

Model training

In this section I’ll define the numeric and categorical variables and create a list for each.

numeric_vars = list(df_train.select_dtypes(include=['int64', 'float64']).columns)
numeric_vars = setdiff(numeric_vars, ['revenue',"log_revenue"])
categorical_vars = list(df_train.select_dtypes(include=['object']).columns)
categorical_vars = setdiff(categorical_vars, ['city', 'type'])
predictors = numeric_vars + categorical_vars
response_var = 'log_revenue'

Here I define a dataframe for the train and test data and an array for the response values.

X_train = df_train[predictors]
y_train = df_train[response_var].values
X_test = df_test[predictors]

Now I define a dict with each of the random forest parameters I will tune. These are n_estimators, max_depth, max_features.

param_grid = dict(
    randomforestregressor__n_estimators=[50, 100, 250, 500, 1000], 
    randomforestregressor__max_depth=[3, 5, 8],
    randomforestregressor__max_features=[0.025, 0.05, 0.1]
)

Finally I define the model pipeline with the following elements:

  • preprocess: handles the preprocessing for categorical (one hot encoding) and numeric variables (scaling).
  • rf: This is the RandomForestRegressor class instanciation.

With these two elements I’m able to create a pipeline (called pipe) and perform a repeated cross-validation. As this dataset has little data the cross-validation error metric isn’t stable. Therefore we need to repeat the cross-validation in order to have a stable error estimate. There are other ways around this, one would be using the bootstrap.

rf = RandomForestRegressor(min_samples_leaf=3)

preprocess = make_column_transformer(
    (StandardScaler(), numeric_vars),
    (OneHotEncoder( handle_unknown='ignore', categories="auto"), categorical_vars))

pipe = make_pipeline(preprocess, rf)

from sklearn.model_selection import RepeatedKFold

repeated_cv = RepeatedKFold(n_splits=5, n_repeats=10)

grid = GridSearchCV(pipe, param_grid, cv=repeated_cv, verbose=1, 
                    scoring='neg_mean_absolute_error', n_jobs=-1)
grid.fit(X_train, y_train)
grid.best_estimator_[1]

RandomForestRegressor(bootstrap=True, criterion=‘mse’, max_depth=8, max_features=0.1, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=3, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

This first model gets a 0.335 MAE on the log scale. We can think of this result as a decent baseline and work to improve it.

grid.best_score_*-1

0.33593938102251286

It’s also possible and sometimes interesting to examine the grid.cv_results_ object and explore which parameters worked better. In this case to keep the post shorter I’ll skip doing this but it’s fairly simple to do as it’s defined as a DataFrame.

The best model uses a max_depth=3, max_features=0.025 and n_estimators=100. The max_features and max_depth have fairly low values which are probably making the model underfit the data. We need to increase the model complexity or try other models. In this post I’ll continue working with random forest but will probably add other models in future posts.

results = pd.DataFrame(grid.cv_results_)
results[results.mean_test_score.min() == results.mean_test_score]
mean_fit_time std_fit_time mean_score_time std_score_time param_randomforestregressor__max_depth param_randomforestregressor__max_features param_randomforestregressor__n_estimators params split0_test_score split1_test_score split43_test_score split44_test_score split45_test_score split46_test_score split47_test_score split48_test_score split49_test_score mean_test_score std_test_score rank_test_score
1 0.099458 0.020013 0.013387 0.004725 3 0.025 100 {’randomforestregressor__max_depth’: 3, ’rando… -0.364532 -0.283532 -0.274141 -0.381783 -0.381362 -0.386255 -0.334142 -0.39051 -0.281727 -0.350919 0.04934 45

1 rows × 61 columns

  • The following code shows how to compute the feature importance metric in the case of a 2 step pipeline.
  • The plot shows how the feature importance metric (an output of the random forest model) decreases.
model = grid.best_estimator_[1]
ctran = grid.best_estimator_[0]
features = numeric_vars + list(ctran.named_transformers_['onehotencoder'].get_feature_names())
imp = pd.DataFrame(dict(variable = features, imp = model.feature_importances_))
imp = imp.sort_values("imp", ascending=False)
imp.plot(x='variable', y='imp')
png

png

These are the top-10 best features of the baseline model trained. We can see the OpenDays is has the highest importance and then come the pxx variables.

imp.head(10)
variable imp
17 OpenDays 0.148338
26 p28 0.079462
20 p6 0.053262
27 p1 0.048812
14 p20 0.041729
6 p2 0.041376
34 p21 0.035650
13 p19 0.034005
32 p11 0.030433
3 p23 0.029722

In order to reuse the code in the next section I create a function for plotting the feature importance

def get_feature_importance(grid, numeric_vars, plot=False):
    model = grid.best_estimator_[1]
    ctran = grid.best_estimator_[0]
    features = numeric_vars + list(ctran.named_transformers_['onehotencoder'].get_feature_names())
    imp = pd.DataFrame(dict(variable = features, imp = model.feature_importances_))
    imp = imp.sort_values("imp", ascending=False)
    
    if plot:
        imp.plot(x='variable', y='imp')
    
    return imp

Now that we trained a model it’s time to start exploring the data. I normally preffer to train a model without doing much analysis and then explore what are the best variables the model selects.

These are scatter plots of the most relevant features

mdf = df_train[['log_revenue','OpenDays', 'p28', 'p6', 'p1']].melt(id_vars='log_revenue')

(ggplot(mdf, aes(x="value", y='log_revenue')) + 
   geom_point() + 
   geom_smooth() + 
   facet_wrap("~ variable", scales='free_x'))
png

png

It’s also interesting to take a quick look at how correlated the pxx variables are. It’s clear there is indeed some correlation among the variables.

cor = df_train[numeric_vars].corr()
cor.head()
p17 p29 p25 p23 p9 p4 p2 p13 p33 p22 p5 p24 p34 p15 p11 p30 p21 p10 p18 p36
p17 1.000000 0.224734 0.855407 0.327883 0.333071 0.235605 0.332134 0.288544 0.716934 0.226162 0.038805 0.814258 0.735778 0.828843 0.274182 0.739895 0.394528 0.324461 0.759491 0.812184
p29 0.224734 1.000000 0.269835 0.393031 0.789784 0.452364 0.322511 0.740722 0.249646 -0.148172 0.306904 0.247974 0.490329 0.299711 0.292211 0.356722 0.267466 0.789510 0.420248 0.457660
p25 0.855407 0.269835 1.000000 0.325615 0.379653 0.285547 0.378042 0.324804 0.765023 0.230894 0.042141 0.935303 0.792694 0.876266 0.285710 0.760051 0.408347 0.370928 0.846651 0.870674
p23 0.327883 0.393031 0.325615 1.000000 0.621546 0.559099 0.513936 0.639188 0.316115 0.316940 0.354058 0.315937 0.470396 0.425270 0.649652 0.475536 0.781189 0.697081 0.416959 0.448681
p9 0.333071 0.789784 0.379653 0.621546 1.000000 0.675336 0.481635 0.905105 0.326869 0.055999 0.465662 0.339668 0.582042 0.379923 0.398431 0.502670 0.559405 0.961266 0.488788 0.551953

5 rows × 38 columns

Creating more features

Here I select the top 4 features based on the importance metric and create summaries of them by taking the min, max, mean and total. The idea is to summarize the combination of these variables with a statistic. This would make the model more complex and can give us a hit wheather this approach is worth exploring further.

#pvars = [var for var in numeric_vars if var.startswith('p')]
pvars = ['p28', 'p6', 'p29', 'p1']

df_train_pvars = df_train[pvars]

X_train['p_min'] = df_train_pvars.min(axis=1)
X_train['p_max'] = df_train_pvars.max(axis=1)
X_train['p_avg'] = df_train_pvars.mean(axis=1)
X_train['p_tot'] = df_train_pvars.sum(axis=1)
numeric_vars_all = numeric_vars + ['p_min', 'p_max', 'p_avg']
rf = RandomForestRegressor(min_samples_leaf=3)

preprocess = make_column_transformer(
    (StandardScaler(), numeric_vars_all),
    (OneHotEncoder( handle_unknown='ignore', categories="auto"), categorical_vars))

pipe = make_pipeline(preprocess, rf)

from sklearn.model_selection import RepeatedKFold

repeated_cv = RepeatedKFold(n_splits=5, n_repeats=10)

grid = GridSearchCV(pipe, param_grid, cv=repeated_cv, verbose=1, 
                    scoring='neg_mean_absolute_error', n_jobs=-1)
grid.fit(X_train, y_train)

grid.best_score_*-1

Fitting 50 folds for each of 45 candidates, totalling 2250 fits

0.33390018275039834

The model gets a slightly lower MAE, 0.333 compared to the previous 0.335. This is of course a small improvement but still it’s better than nothing. The feature importance metric now shows the new variables are among the best ones, which is no surprise as they are summaries of the best features already.

get_feature_importance(grid, numeric_vars_all, plot=False).head(10)
variable imp
17 OpenDays 0.134915
40 p_avg 0.088815
26 p28 0.063852
39 p_max 0.044828
14 p20 0.039698
20 p6 0.038070
6 p2 0.035791
27 p1 0.033522
38 p_min 0.032761
34 p21 0.031743

More Feature engineering

In this last section I extend the previous idea of creating summaries of multiple features. The idea is taking the top 12 variables from the feature importance metric and creating 3-way combinations among them to compute the average. I call the combinations pvar_combinations and the features comb_features.

best_pvars = list(imp.head(12).variable.values)[1:]
import itertools

#pvars = [var for var in numeric_vars if var.startswith('p')]
pvar_combinations = list(itertools.combinations(best_pvars, 3))

len(pvar_combinations)

165

comb_features = ["_".join(comb) for comb in list(pvar_combinations)]

Finally I create the mean for each of the combinations in the X_train DataFrame. Now we should have 165 extra features.

for comb in pvar_combinations:
    var_name = "_".join(comb)
    X_train[var_name] = df_train[list(comb)].mean(axis=1)
X_train.shape

(137, 208)

now I remove the features I created manually, p_min, p_max, p_avg.

X_train.drop(columns=setdiff(numeric_vars_all, numeric_vars), inplace=True)
numeric_vars = list(df_train.select_dtypes(include=['int64', 'float64']).columns)
numeric_vars = setdiff(numeric_vars, ['revenue',"log_revenue"])
numeric_vars_combs = numeric_vars + comb_features
param_grid = dict(
    randomforestregressor__n_estimators=[100, 250, 500, 1000], 
    randomforestregressor__max_depth=[3, 5, 8],
    randomforestregressor__max_features=[0.1, 0.2]
)

rf = RandomForestRegressor(min_samples_leaf=3)

preprocess = make_column_transformer(
    (StandardScaler(), numeric_vars_combs),
    (OneHotEncoder( handle_unknown='ignore', categories="auto"), categorical_vars))

pipe = make_pipeline(preprocess, rf)

from sklearn.model_selection import RepeatedKFold

repeated_cv = RepeatedKFold(n_splits=5, n_repeats=10)

grid = GridSearchCV(pipe, param_grid, cv=repeated_cv, verbose=1, 
                    scoring='neg_mean_absolute_error', n_jobs=-1)
grid.fit(X_train, y_train)

grid.best_score_*-1

0.3302676654693979

The model with the 165 combinations had a lower mean absolute error, now it’s 0.330 compared to the initial 0.335. A small improvement but still it’s something. Probably linear models can benefit from these type of feature engineering.

All the code used in this notebook is available in github.