7 min read

Backtesting a machine learning strategy in R

Models with quantmod

The idea of this post is to give a simple example of how to use a machine learning algorithm such as random forest to predict returns of the next day. This would serve as an example to explore how the quantmod model functions work, these are specifyModel, buildModel and tradeModel. This is a very simple example based on the help of ?buildModel but should give you an idea to get started.

Train a random forest model for trading

I start by loading the libraries needed for this example and getting some data. In this case I’ll use the Nasdaq-100 ETF ticker, QQQ.

options("getSymbols.warning4.0"=FALSE)
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Version 0.4-0 included new data defaults. See ?getSymbols.
QQQ = getSymbols('QQQ', src='yahoo', auto.assign=FALSE)
head(QQQ, 3)
##            QQQ.Open QQQ.High QQQ.Low QQQ.Close QQQ.Volume QQQ.Adjusted
## 2007-01-03    43.46    44.06   42.52     43.24  167689500     38.32703
## 2007-01-04    43.30    44.21   43.15     44.06  136853500     39.05387
## 2007-01-05    43.95    43.95   43.48     43.85  138958800     38.86772

In supervised learning a common approach is to define a target variable and a set of features (also called predictors). In this case the intraday return is the response (close/open - 1) and the predictors are based on technical indicators. The idea is to predict what would the return be for the next day given the indicators. Once I trained model I’ll be able to predict the return of the next day. If I expect to have positive returns, then I’ll go long, otherwise I’ll go short at the next day open.

# Define simple return

R = Cl(QQQ)/Op(QQQ) - 1
y = Next(R, k=1)

Here I define the features. This part normally would be more complex but the idea is to give a simple example in this post cover how quantmod works.

# skip 200 rows to avoid having missing values  
skip_rows = 200

# Features
sma50 = SMA(Cl(QQQ), n=50)
sma150 = SMA(Cl(QQQ), n=150)

X = cbind(Lag(OpHi(QQQ), 0:3),
          ROC(SMA(Cl(QQQ), n=2)),
          ROC(SMA(Cl(QQQ), n=15)),
          ROC(sma50, n=1),
          ROC(sma50, n=5),
          ROC(sma150, n=1),
          ROC(sma150, n=5))

# Skip rows with NAs
y = y[skip_rows:nrow(y), ]
X = X[skip_rows:nrow(X), ]

I’ll add some variable names to recognize them later on. X represent the features or predictors.

names(X) = c("OpHi_t0", "OpHi_tm1", "OpHi_tm2", "OpHi_tm3", 
             'sma2_roc', 'sma15_roc', 
             'sma50_roc', 'sma50_roc5',
             'sma150_roc', 'sma150_roc5')
head(X, 2)
##                OpHi_t0    OpHi_tm1    OpHi_tm2    OpHi_tm3    sma2_roc
## 2007-10-17 0.000746009 0.009282023 0.001865286 0.012098279 0.004048779
## 2007-10-18 0.010500600 0.000746009 0.009282023 0.001865286 0.008514671
##              sma15_roc   sma50_roc sma50_roc5  sma150_roc sma150_roc5
## 2007-10-17 0.002830327 0.002111031 0.01043326 0.001490284 0.007197906
## 2007-10-18 0.002784420 0.001986098 0.01048550 0.001528517 0.007358285

I use the formula interface to define a model.

model = specifyModel(y ~ X)
model
## 
## quantmod object:     Build date:   
## 
## Model Specified: 
##      y ~ X 
## 
## Model Target:  y          Product:  y 
## Model Inputs:   
## 
## Fitted Model: 
## 
##  None Fitted

The model is an S4 object therefore I need the slot syntax to access the object elements.

model@model.formula
## y ~ X.OpHi_t0 + X.OpHi_tm1 + X.OpHi_tm2 + X.OpHi_tm3 + X.sma2_roc + 
##     X.sma15_roc + X.sma50_roc + X.sma50_roc5 + X.sma150_roc + 
##     X.sma150_roc5
## <environment: 0x55f8947119a0>

Then I “build” the model. Here I pass the model that has the formula definition and select the method, randomForest. It’s possible to pass parameters of the method, for this example I use mtry=3 and ntree=250. Finally I select the training period, see the training.per argument.

model = buildModel(model, 
                   method='randomForest',  
                   mtry=4, # number of variables for random forest to try
                   ntree=250, # number of trees
                   training.per=c('2018-03-01', '2019-03-31'))
## loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.

I can access the fitted.model slot with the @ symbol. In this case we can inspect the feature importance output random forest generates.

model@fitted.model[['importance']]
##               IncNodePurity
## X.OpHi_t0       0.004390306
## X.OpHi_tm1      0.003073909
## X.OpHi_tm2      0.002836356
## X.OpHi_tm3      0.003316077
## X.sma2_roc      0.004224781
## X.sma15_roc     0.005033255
## X.sma50_roc     0.003387254
## X.sma50_roc5    0.003585903
## X.sma150_roc    0.003474479
## X.sma150_roc5   0.002731041

Results of the trading strategy.

trades = tradeModel(model, exclude.training=TRUE, plot.model=TRUE, 
                    trade.dates=c("2019-04-01", "2019-12-01"))
trades
## 
##   Model:  randomForest1592954717.39087 
## 
##   C.A.G.R.:  0.83%   H.P.R.:  0.79% 
## 
##   Returns by period summary:
## 
##             weekly monthly quarterly yearly
##     Max.     4.65%   8.68%     5.80%  1.06%
##     3rd Qu.  0.67%   0.26%     3.23%  1.06%
##     Mean     0.04%   0.19%     0.45%  1.06%
##     Median  -0.00%  -0.38%     0.65%  1.06%
##     2rd Qu. -0.72%  -1.62%    -2.22%  1.06%
##     Min.    -2.70%  -3.09%    -5.10%  1.06%
## 
##   Period to date returns:
## 
##              weekly monthly quarterly yearly
##              -0.39%   0.10%     0.65%  1.06%

With a few lines of code quantmod allowed me to define a model and evaluate how it works with out of sample data. This is a great tool for prototyping models quickly and getting an intuition if they will work or not.

Example with a different model

There are multiple methods supported in quantmod (you can see the complete list in the docs of buildModel). I ll try one more and see how they compare to random forest. To simplify the code, I define a helper function, so far all I change is the method and I use the default parameters of each model.

fit_model <- function(y, X, method, training.per, trade.dates){
  model = specifyModel(y ~ X)
  model = buildModel(model, method, training.per)
  trades = tradeModel(model, exclude.training=TRUE, plot.model=TRUE, 
                      trade.dates=trade.dates)
  trades
}

Here is the fit using a linear model.

traning.per = c('2018-03-01', '2019-03-31')
trade.dates = c("2019-04-01", "2019-12-01")

lm_trades = fit_model(y, X, method='lm', traning.per, trade.dates)
lm_trades
## 
##   Model:  lm1592954717.451 
## 
##   C.A.G.R.:  15.77%  H.P.R.:  15.03% 
## 
##   Returns by period summary:
## 
##             weekly monthly quarterly yearly
##     Max.     4.89%   6.59%     8.68% 14.72%
##     3rd Qu.  1.51%   3.70%     6.52% 14.72%
##     Mean     0.40%   1.77%     4.73% 14.72%
##     Median   0.01%   2.01%     4.36% 14.72%
##     2rd Qu. -0.57%  -0.17%     2.76% 14.72%
##     Min.    -1.63%  -2.72%     1.15% 14.72%
## 
##   Period to date returns:
## 
##              weekly monthly quarterly yearly
##              -0.02%  -2.72%     1.15% 14.72%

Next I use an SVM with a radial kernel. SVM is able to capture non-linear functions. It’s able to capture more complex functions compared to a linear model.

library(e1071)
svm_trades = fit_model(y, X, method='svm', traning.per, trade.dates)
svm_trades
## 
##   Model:  svm1592954717.5202 
## 
##   C.A.G.R.:  12.10%  H.P.R.:  11.54% 
## 
##   Returns by period summary:
## 
##             weekly monthly quarterly yearly
##     Max.     3.62%   3.89%     4.49% 11.24%
##     3rd Qu.  0.97%   1.52%     4.32% 11.24%
##     Mean     0.31%   1.35%     3.62% 11.24%
##     Median   0.40%   1.28%     4.14% 11.24%
##     2rd Qu. -0.38%   1.00%     3.18% 11.24%
##     Min.    -3.36%  -0.59%     2.23% 11.24%
## 
##   Period to date returns:
## 
##              weekly monthly quarterly yearly
##               1.53%   0.73%     2.23% 11.24%

In this simple example the lm and svm models are showing much better results than randomForest. It is to be expected tree based models not to do great for this problem, at least with the features I selected.

Next I’ll plot the cumulative performance of each trading strategy to get an intuition of how the models perform. It’s also intersting to see how this compares to a buy-hold strategy.

library(PerformanceAnalytics)
## 
## Attaching package: 'PerformanceAnalytics'
## The following objects are masked from 'package:e1071':
## 
##     kurtosis, skewness
## The following object is masked from 'package:graphics':
## 
##     legend
all_trades = list(rf=trades, svm=svm_trades, lm=lm_trades)
daily_returns = lapply(all_trades, function(mdl) {
  ret = mdl$return@returnsBy[, "daily"]
})
daily_returns = do.call(cbind, daily_returns)
daily_returns = merge(R, daily_returns)

daily_returns = daily_returns[complete.cases(daily_returns), ]
names(daily_returns) = c("QQQ", names(all_trades))

head(daily_returns)
##                      QQQ            rf           svm            lm
## 2019-04-03  0.0001088594 -0.0001088594  0.0001088594  0.0001088594
## 2019-04-04 -0.0006527498 -0.0006527498 -0.0006527498 -0.0006527498
## 2019-04-05  0.0022796298  0.0022796298  0.0022796298  0.0022796298
## 2019-04-08  0.0041766327  0.0041766327  0.0041766327  0.0041766327
## 2019-04-09 -0.0001084228 -0.0001084228 -0.0001084228 -0.0001084228
## 2019-04-10  0.0040601992 -0.0040601992  0.0040601992  0.0040601992

Each strategy returns:

table.AnnualizedReturns(daily_returns)
##                              QQQ     rf    svm     lm
## Annualized Return         0.0457 0.0160 0.1733 0.2288
## Annualized Std Dev        0.1166 0.1166 0.1161 0.1158
## Annualized Sharpe (Rf=0%) 0.3922 0.1369 1.4924 1.9751

Cumulative performance:

charts.PerformanceSummary(daily_returns, wealth.index=TRUE, geometric=FALSE)