Predicting the Stock Market with Machine Learning. Findings.

Original article can be found here (source): Artificial Intelligence on Medium

Vhinny Investing

Predicting the Stock Market with Machine Learning. Findings.

Identifying High Income Growth Stocks with Machine Learning

Image by teerapon

In this article, I present my key findings from predicting the next year’s income of the S&P500 companies between 2014 and 2019. While it is not required, I encourage the reader to see my path towards the observations I share here:

Let’s get started!

Know What the Model is Doing

The key to designing an excellent machine learning model is to know exactly what it is doing. Reaching this understanding may be easy when working with something like Regression and incredibly difficult when working with Deep Learning Neural Networks. Nevertheless, the one in charge should always aim to understand how the model is thinking. Without it, model’s predictions aren’t worth anything.

What my Model is Doing

In the previous article, I’ve reduced the number of features used in the original model from 100 to 9 without sacrificing model performance. The remaining features were:

  • Return on Capital
  • Earnings per Diluted Share
  • Return on Tangible Assets
  • Dividends to Earnings
  • EPS Growth
  • Cash
  • Operating Margin
  • Free Cash Flow to Earnings
  • Non Operating Expense

I’ve used Random Forest feature importance to determine what features should stay and what features should be discarded. While feature importances helped me to understand how much weight each variable had in predicting the score, it did not tell me how exactly each feature affected that score. Should the Operating Margin be high or low? Do we look for high Dividends to Earnings or low? How much cash should there be?

Finding Direction

One method that contributes direction to feature importances in tree-based algorithms takes advantage of SHAP values. Take a look here if you are not familiar with this technique.

SHAP values for my model are presented below. Please note that I’ve removed Non Operating Expense from the features described above as it didn’t add tangible value to the model performance.

The X axis on this plot measures the impact each feature has on the prediction score. The Y axis shows all features used to generate a prediction score. The color bar shows whether the value of that feature is high or low. For example, high dividends to earnings encourage the model to make a positive prediction (red color, positive SHAP value). High return on tangible assets, however, encourages the model to make a negative prediction (red color, negative SHAP).

This plot reveals relationships that might not be intuitive to an intelligent investor. It says that companies that double their income in the following year should have low return on capital, low return on tangible assets and low operating margin — all of which are indicators of poor business performance.

One Step Further

SHAP value analysis allows me to look at how each feature impacts predictions at an individual sample level. Let’s take a look at the top 2 correct predictions my model has made.

Top 2 Positive Predictions

The model predicted Starbucks (SBUX) to double its income in 2013. Let’s look at it in detail. All 8 predictors are in the red, positively contributing to the prediction score, pushing it all the way up to 0.96.

As expected from the “Finding Direction” section, we have suspiciously low Return on Capital (0.001), yet suspiciously high Free Cash Flow to Earnings (520) and a Dividends to Earnings ratio (76).

These ratios are not within the normal range and encourage further investigation. The company had enough cash to pay the usual dividends yet didn’t have much income. Let’s see how this fits with the global picture by looking at Vhinny’s Historical Ratios & Data.

Indeed, net income reported in 2013 was $8.3M, which is a large drop compared to $1.4B and $2.1B in 2012 and 2014 respectively. Going one step further, one would find a $2.8B litigation charge Starbucks wrote down that year. While $2.8B was a heavy hit, Starbucks’ earning potential wasn’t hurt. High Free Cash Flow to Earnings showed that the business made money nevertheless and carried on with normal operation. Sure enough, net income returned back to normal the following year, proving my model’s prediction to be correct.

What Makes for a Negative Prediction

Having seen how the model makes confident positive predictions, let’s take a look at how it makes confident negative predictions. Below are the SHAP values for the top 2 negative predictions.

Top 2 Negative Predictions

For these two examples, all 8 predictors are in the blue, negatively affecting the predicted score. In fact, the predicted score for both these examples is 0.00 — as low as it can be.

Looking at Home Depot (HD), one can see a good Return on Tangible Assets (0.20) and a good Return on Capital (0.26) as well as Free Cash Flow to Earnings around 1.41 which is in the normal range. Similar observations can be made for Helen of Troy (HELE).

These three metrics define HD and HELE as profitable companies generating high returns. Investors generally like this type of companies due to their strong business performance. Why does my model believe that a good business would not double its income in the following year?

While it is not impossible for good companies to double their annual income, the data suggests that, in general, companies that have strong business characteristics grow gradually and confidently. Practically speaking, it would take twice as many people building houses around the world for Home Depot to double its earnings. With $80B in annual sales, it is just not likely to happen under normal circumstances. For the model to predict the possibility of drastic increase in earnings, there needs to be something in the outlier zone in the data, which is what we saw with confident positive examples.


In this article, I’ve described how the model I’ve built to predict next year’s income makes its decisions. The results suggest that good businesses under normal operating conditions are not likely to double their earnings in the following year. At the same time, businesses that lost their income due to a “one time” event are likely to get back on their feet and continue their operations as normal. The model was able to identify such companies by analyzing how much cash the company made in a given year in context of tis income.

I encourage the reader to take my results as ideas and conduct their own study if the reader, just like myself, is looking to find signal in the data and concur the stock market. If the reader is interested in this particular study, a good place to advance my findings would be to look at how company size affects the probability of a business getting back to normal after it took a hit.

This article concludes this series of predicting the next year’s income using fundamental financial data. I hope the reader finds the insights I’ve shared here valuable and and take advantage of them in their own stock selection process.

I’m moving on to studying the stock market crush of 2020. See you in the next one!

Let’s Connect!

I’m happy to connect with people who share my path, which is the pursuit of financial independence. If you also search for financial independence or if you’d like to collaborate, bounce ideas or exchange thoughts, feel free to reach out! Here are some resources I manage: