Do Not Fall into These Financial Back-Testing Traps.

Original article was published by Sofien Kaabar on Artificial Intelligence on Medium

Overfitting and Underfitting

After reading the brilliant work of Dr. Marco Lopez De Prado on biases a few months ago, I have come to understand that there is always something that escapes us no matter how perfect we think the analysis was.

Overfitting remains unfortunately very common even among professional practitioners. It is a difficult obstacle to overcome, and the only way to avoid it is by respecting a few guidelines that will be outlined later. But we should understand first what overfitting is.

Overfitting is when a model forecasts data using a relationship that is so close to the past values that it fails to account for the general existing relationship. Thus, on the training period, the results will be good but on the testing period, they will be disappointing. Underfitting is the opposite of overfitting, it is the equivalent of a model that has not done its homework to fully understand the data. We have what we call a bias and a variance problem which are both considered “fit” issues:

  • Bias is also known as underfitting and this is simply when the model encounters a signal and thinks it is noise.
  • Variance is also known as overfitting and this is simply when the model encounters noise and thinks it is a signal.

The problem in real life is that by reducing one, you are increasing the other, hence, you are making a trade-off between the two. You want to find the right balance between sacrificing one for the sake of the other, this is called the Bias-Variance Trade-Off.

How can we fix Overfitting?

  • Decreasing the model’s complexity by removing layers to its calculation methodology or by removing variables that may explain the nature of the relationship between the explanatory variables and the dependent variable.
  • Decreasing the training period so that model does not exactly fit the training data.
  • If you are using neural networks, consider a dropout function and an early stopping technique.

How can we to fix Underfitting?

  • Increasing the model’s complexity by adding layers to its calculation methodology or by adding variables that may explain the nature of the relationship between the explanatory variables and the dependent variable.
  • Increasing the training period so that model finds more data to work with and be able to make predictions based on sufficient information.

Naturally, when we want to back-test our model, we want it to use its knowledge on unseen data, so we split our historical data into two with the first set called the training set and it is the one that the model will analyze while the second (more recent) one called the testing set, and it is the one that the implied relationship will be calculated on.

In other words, the model has understood the relationship by looking at the training set and is now expecting it to continue on the test set, so, it will predict on the test set.

Summary Table. (Image by Author)

If we want to be technically correct, the only out-of-sample period is the one we haven’t seen yet (i.e. the future), but for now, we can consider the test set to be out-of-sample. Overfitting and Underfitting occur during the in-sample period and we see the disappointing results on the out-of-sample period.

Forgetting Transaction Costs

Something you should always be aware of is that back-testing results are mostly false. You are likely to never get a good estimate of future results except perhaps by luck. You cannot accurately estimate the actual fees, spreads, slippage, and any other unexpected events that will occur during live execution and therefore when including a proxy of these costs in your back-tests, it is always helpful to bias them upwards.

For instance, let us assume that the average historical bid-ask spread on the USDCAD pair given by your broker is 0.6 pips, the best thing to do is to suppose that the actual spread is at least the historical average plus a margin for all the unexpected costs. This is an example of upward biasing because we know that by time, bid-ask spreads are getting more competitive (i.e. smaller) which is a positive thing regarding market efficiency.

A more detailed example in the following table:

The best argument for biasing the costs upwards is to escape from unpleasant surprises during live trading as well as to test the robustness of the model when encountering a volatile environment.

The disadvantage of doing so is that many short-term models will get filtered out, for instance, models that run on M5 and M15 time frames are more sensitive to costs than models that run on hourly time frames and thus cost management is imperative for the model to be able to provide consistent results.

However, if your model depends on maximizing the accuracy of expected transaction costs, then it is helpful to know that they have been proven to be non-linearly correlated with certain variables such as actual volatility. However, a more simplistic example would even be to try running a regression using past variables to explain the historical costs and assuming that the relationship will hold over the short time frame. The back-tests will use just a small number of these performance metrics and the transaction costs will be arbitrary. The conclusion of this point is that you should never run a back-test without incorporating transaction costs. In the articles I publish, I always include bid-ask spreads even when I do not mention them. If you are interested in seeing some of my back-tests, consider this article:

Look-Ahead Bias

This bias is known in the field of back-testing and research and although it is considered more of a rookie mistake, it must never be forgotten. Look-ahead bias is when you are using the future to predict the past. Consider a strategy that relies on the daily closing values of the S&P500 to make predictions. When designing the model, you erroneously, use today’s close to predict today’s close, hence, you are using a future information to try to predict the future and this is of course unfeasible.

The table below shows quarterly GDP data for the United Kingdom in 2015

Notice that the third quarter GDP which usually refers to the period between 01/07/2018 and 30/09/2018 is published on 09/11/2018, hence, a lag of more than a month. Any researcher should be careful not to put these figures with other figures that actually do get released on 30/09/2018. Using and assigning data that would otherwise have not been available on that date is referred to as look-ahead bias. Failing to make the appropriate adjustments will make the whole analysis erroneous, unrealistic, and impossible to reproduce. Economic data particularly suffers from this problem and adjustments should be made to take into account the lag.

With some governmental data, changes and updates can be made and thus introducing errors to the model. For instance, the above GDP table has had many revisions for it to finally settle on the above values. Data update also contributes to the fact that correct information might come even later than expected. After correcting for look-ahead bias, we may find ourselves in front of a newly revised value that could alter predictions and temper with the training of the model. A possible and naïve way to correct for this issue is to only use the preliminary releases and disregard any updates, as it is understood that they will only be known at a later time in the future, it becomes useless to incorporate them inside the model. We deal with what we have now.

Not Accounting for Market Regime Changes

Most of the time, markets are either trending or ranging and while we can develop strategies for both market regimes, it is difficult to find one strategy that is able to capture the change in the regime and adapt itself all while continuing to be profitable. This Strategy-of-Everything is unlikely to exist at the moment as financial time series are highly complex and dynamic.

When performing a back-test, you have to make sure to define what type is your strategy and on which regimes (states) will you be testing it on. Let us look at an example, consider a strategy that goes long (initiates a buy order) each time we have three consecutive lower lows and goes short (initiates a sell order) each time we have three consecutive higher highs. It is clear that this is a contrarian strategy more suited inside a range.

Now, if we apply it on a trending market such as the S&P500, what would be the result? I will give you a hint: Bad.

The S&P500 Index. A Mostly Trending Market.

The answer is clear, trending markets require trend-following strategies. But what about markets that alternate between ranging and trending? In that case, we need other tools to approximate the current state and use a proper strategy. This can get complicated very fast. Here is an example of a trend-following strategy I like to use from time to time:

Non-Stationary Data

Stationarity is synonymous to a constant mean over time. A changing mean will cause the model to produce erroneous forecasts. A time series data is stationary if it has a constant mean and variance, that is, its mean does not change much by time. The same goes for its variance (volatility).

In other more technical terms, stationarity is when prices diffuse at a slower rate than a geometric random walk. Financial data have too much noise and differencing or taking log returns will make them almost stationary at the cost of losing their memory, but that is the best we have got right now at the very basic level.

Note that we are talking about feeding our machine learning models with inputs to produce a forecast.

A stationary data series with a mean ≈ 0.09. (Image by Author)

As time series (prices per se) exhibit significant autocorrelation in small intervals of time, it is rational to assume that it is quite easy for the model to deliver such good results. When you have a machine learning algorithm that has predictive power, you must use it on stationarity data otherwise, the results will be false.

As time series (Prices) are significantly autocorrelated, the model will always follow the latest value, and therefore, it will likely reproduce the previous value and call it a forecast. When time series are transformed into stationary data either by differencing or by taking returns, this problem more often than not goes away.

Let us see what happens when we use a normal auto-regression technique to forecast Bitcoin. The first test will use non-stationary (i.e. BTCUSD prices) data and the second test will use stationary data (i.e. BTCUSD returns).

Test #1: Non-Stationary Data

In layman’s terms, we will apply a machine learning model to actual prices (and not returns) of Bitcoin relative to the US dollar (BTCUSD pair) and evaluate our predictions on the out-of-sample dataset. Below are the rules and results of the predictions:


  • Test asset: BTCUSD
  • Model used: Linear regression.
  • Training days: 2221.
  • Testing days: 100.



The R-square means that our model explains 98% of the variations, that is a 0.99 correlation between the predicted values and the actual values. A utopian model like this does not exist in such a complex world, and whoever is capable of making such a model will be a billionaire in less than a week.

Something must be wrong here, and there is. First, we take a look at the below graph showing the high linear correlation between the predicted and real values using a simple linear regression model that uses past prices as explanatory variables to guide it with finding future prices.

R-square between the predicted and actual values shows a superior modelling power of our algorithm. (Image by author)

However, if we plot the actual and predicted values on a line chart to better visualize the correlation and see how good our model is doing, we see something strange. Indeed, it seems that our model is following the actual values with a lag of one. If today’s values go up, our forecast for tomorrow is also up and vice versa. It appears that we are not really doing anything but repeating yesterday’s news. Not only does this model not have a predictive power but over time the transaction costs will eat any stochastic (random) profits that may come with luck.

Actual values vs predicted values. The model seems to be simply replicating the value of yesterday. (Image by author)

Test #2: Stationary Data

The models we use are based on the fact that the time series is stationary which in turn will provide a real forecast, that is, the model is actually doing something useful. Evidently, most machine learning models should be used to predict the differences between prices (in the case of asset predictions). Let us now, repeat the above experiment with the exact same rules but only this time we will be using returns data (differenced data can also be used). The lagged period is also the same.

The results show that now the model actually does a bit worse than a random walk model which might suggest the same for the dataset. Are the fluctuations of the BTCUSD random? It takes more analysis to answer that question but for now we can safely say that our algorithm with the actual parameters cannot forecast the direction of the asset. This is obvious, because a simple linear model cannot predict a highly complex market.


  • Test asset: BTCUSD returns.
  • Model used: Linear regression.
  • Training days: 2221.
  • Testing days: 100.


  • R-square: 0.03
  • Accuracy: 49%
The plot of predicted vs actual values shows no correlation between the two whatsoever, giving a stronger conviction of the underperformance of our model. (Image by author)
The returns of BTCUSD were much more volatile than predicted with the linear regression model. The model has been wrong about 51% of the time. (Image by author)

Another measure worth mentioning in the case of a linear model is the R-square. This goodness-of-fit measure is very common in econometrics. It is the percentage of the dependent variable that is explained by the independent variable(s). Before we introduce the formula (that is very simple), we must mention two calculations, SSE (squared sum of errors) is the unexplained part by the model and the SST (squared sum of totals) is the unexplained plus the explained part by the model. Intuitively, from the formula below we can see that the R-square measures the percentage explained by the model.

Focusing too Much on the Hit Ratio

Sure, an 80% hit ratio on your trades is great. But what if you are risking $1 each time to earn $0.20 (20 cents)? Well, then you will lose money and get wiped out because your risk reward will be 0.2. If you make 100 trades where you always use the same position sizing and you get your 80% hit ratio which translates to 80 profitable trades with each gaining $0.20.

This gives you a profit of 80 x 0.2 = $16. Alright, not bad but let us see the remaining losing 20 trades which have lost $1 each. This gives you a loss of 20 x 1.0 = $20. Your net profit is therefore -$4.00. Hence, by getting it right 8 out of 10 times, you have managed to lose money. How to fix this?

Risk-reward Trade-off and the Hit Ratio. At a risk-reward Ratio of 1.00, we need 50% to breakeven. (Image by author)

We have to expect at least $1.8 for every $1.0 we are risking. This gives us a margin to wiggle with. With a risk-reward ratio of 1.8, we only need a hit ratio of 35.70% to breakeven. Thus, consider evaluating a strategy that had 40% hit ratio with a risk-reward of 1.82 and using the same position sizes.

  • Total profit = 0.4 x 1.82 = +$0.728
  • Total loss = 0.6 x 1.00 = -$0.600
  • Net profit = 0.728–0.600 = +$0.128

To compute the required hit ratio to break even, you can use the following formula:

Not Taking Into Account Yearly Performances

Look at the below equity curve and tell me what do you see? Clearly, it is upwards sloping and looks attractive. After all, you have started with $1,000 and now have around $4,500. Now, let us zoom in.

An example of a Strategy that has produced positive cumulative returns. (Image by author)

Although it does look good, when we take the years one by one, we find some losing years that are much less attractive and can actually wipe us out if we start trading this strategy at the wrong time. This begs the question, if we were truly trading based on this strategy and had a bad year, would we continue? Unfortunately, we will never know and that is why we need a strategy that wins most of the time (years) and not one that greatly outperforms in a few years but spends most years losing money or being flat.

We should look at the evolution of the strategy and not just stick to basic performance statistics. The above strategy had a 61.67% Hit ratio but still manages to be somewhat bad.

Not Taking Into Account Risk Management Before Going Live

When you back-test a strategy, you must account for stops and profit orders. In other words, when you do apply the strategy in real life, you will place stops and profit orders. You should know that these orders change drastically the performance. Here is the RSI strategy with and without placing fixed stops.

EURUSD Hourly with 14-Hour RSI. (Image by author)

Following a “buy when the RSI(14) touches 20 and sell when the RSI touches 80” strategy, we get the following results for both tests:

Comparison between the two RSI Strategies. (Image by author)

Notice the huge difference between the final results? Even though they are both negative, the one without risk management did much worse. We do not want that to happen when we switch from virtual to real time trading.


There will always be some form of bias in the back-test. Our job as researchers and traders is to minimize them so as to maximize the probability of realization. We are all of course familiar with the saying that the history does not repeat itself or that the past is not a reflection for future profits but the past is the best we have in our fight against the future. If you manage to at least incorporate some of the above points, then you are likely on the right track. Remember, finding your strategy is not an overnight process, be patient.

Image by Nattanan Kanchanaprat from Pixabay