Interrogating Industry Big Data Assumptions

Original article can be found here (source): Artificial Intelligence on Medium

It is clear that ANI is only as good as its data set. There is a good reason that the majority of time spent by computer scientist working in machine learning is used to secure, clean, and modify the research and training data sets an algorithm or network will be working through. On one hand, the advances in “big data” have no doubt launched machine learning research into a new era of accuracy and capacity. The authors of Deep Learning accurately claim that “As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category and will match or human performance when trained with a dataset containing at least 10 million labeled examples.”[4] This is an incredible statistic and shows the amazing capability big data has given the field of AI.

However, big data is not as much of a magic bullet to AI development as the authors seem to infer. Beyond the many ethical concerns brought by big data including bias, privacy, and transparency, it is not entirely clear that the gains made by increasing the data sets will not have diminishing returns in terms of AI advancement. Skepticism around the magic bullet of big data stems largely from the fact that correlation does not always equal causation. There is a large chance that if the strategy of the field of deep learning is simply to keep increasing its data set all it will serve to do is increase the amount of “noise” in its computation. There will inevitably be a tipping point where more data will actually obfuscate an algorithm’s goal.

In his article “The Hidden Risk of AI and Big Data” computer scientist Vegard Flovik examines this concept of diminishing returns by arguing “With enough data, computing power and statistical algorithms patterns will be found. But are these patterns of any interest? Not all of them will be, as spurious patterns could easily outnumber the meaningful ones.”[5] This is just one technical example of why big data may not be the solution to unlocking AGI that so many in industry have claimed it to be. At the very least, there must be greater intentionality and nuance in the use and curation of data instead of a blind assumption that a larger data set will automatically usher in a breakthrough in AGI.

There is a similar caution to be leveraged against the optimism surrounding the continued exponential growth of processing power and the advancement of hardware on which AI software will run. In 1965 Gordon Moore, co-founder of Intel, postulated that the number of transistors on a CPU would double every two years. Moore’s Law, as it would become called, has largely held steady. The authors of Deep Learning posit that the “growth [of deep learning capability] is driven by faster computers with larger memory and by the availability of larger datasets. Larger networks are able to achieve higher accuracy on more complex tasks. This trend looks set to continue for decades.”[6]

However, this statement about the continuation of this trend is based in a tenuous assertion that Moore’s Law will continue indefinitely. Unfortunately, throughout 2018 and 2019 an increasing amount of academics and engineers are claiming that Moore’s Law is dead or that at the very latest Moore’s Law will end by 2025.[7] It seems like the field of machine learning is making the switch to lean more into the processing power that already exists then pin its future on processing power in the future that may never exist. However, it is important to be appropriately skeptical at claims that trends of increased processing power will continue indefinitely as the conclusion in Deep Learning appears to point to.