Original article was published on Artificial Intelligence on Medium
How Related Are Data Science Subfields?
As I was examining the trend lines it occurred to me they show much less correlated movement than I expected given how tightly related many of these search terms intuitively are. Some fields took off faster than others, and also at different times.
I decided to run a correlation analysis to more formally examine these relationships or lack thereof. The raw search interest data was detrended using first order differencing, resulting in a stationary series. Initially I used Pearson’s correlation coefficient and noticed a number of coefficients in the 0.6–0.7 range. Upon further examination these turned out to be mostly due to outliers. As the relationship might also be non-linear, I decided to use Kendall’s tau rank coefficient instead of Pearson’s. These are more robust to outliers and would capture any type of monotone relationships, not just linear ones (see technical note 2). Kendall’s tau is somewhat less powerful though.
The correlations were computed using this correlation coefficient calculator I’ve coded which also provides accompanying statistics. The p-values for many of the correlations were very small and the 95% confidence intervals quite narrow suggesting genuine relationships.
The resulting correlation matrix shows that search interest for the different topics is not nearly as related as one might expect. The largest correlation is 0.56 and it is between ‘deep learning’ and ‘machine learning’. A few others are in the 0.4–0.5 range, but most are lower.
It seems ‘machine learning’ correlates the most with all other search topics, whereas ‘business analytics’ correlates the least. ‘data science’ shows positive correlations across the board, but these are slightly weaker compared to those of ‘machine learning’. Search interest in ‘artificial intelligence’ exhibits low correlation with the rest, and so does the interest in ‘reinforcement learning’. The remainder show coefficients mostly in the 0.2–0.4 range.
Data science topics exhibit impressive growth in interest over the past decade. ‘data science’ and ‘deep learning’ in particular exhibit astonishing levels of growth between 2010 and 2020. In just a few short years they make the journey from near non-existence to leading topics in terms of search interest.
That said, it seems interest in most data science topics have already peaked. For some the peak was as early as 2015, while for others it was much sooner. Though we might see a recovery in the rest of 2020, the trend of rapid growth for ‘data science’ which was established in 2013 is surely broken.
A correlation analysis found interest in most sub fields to be loosely to moderately related with the highest correlation coefficients being around 0.5. This shows that the data science field is quite diverse, and each sub field has more or less a life of its own, with its own periods of glory and decline.