The Roots of Data Science

Original article was published by Favio Vázquez on Artificial Intelligence on Medium

Thirty-five years later after Tukey’s publication, Jeff Wu said this:

Statistics = Data Science?

Where he proposed that statistics should be renamed “data science” and statisticians should be named “data scientists”. In today’s standards, we know that statistics alone is not of data science, but why? Because we also need programming, business understanding, machine learning, and more (but more on that soon).

In a conversation with Jeff Wu, he mentioned that:

“My lecture was entitled Statistics = Data Science?. There I characterized statistics as a trilogy of data collection, data analysis and decision making. I was clearly talking about analytic statistics, rather than descriptive statistics. I suggested a change of our name from “statistics” to “data science” and “statistician” to “data scientist.” I remember I even jokingly said in the lecture that by merely changing the name to data scientist, the salary will go higher. This is true nowadays. It’s interesting”

Something interesting about Wu’s definition of statistics is that data analysis is a part of it. I’m not entirely sure if Tukey will agree with Wu, but the idea is clear:

Data science depends on data collection, data analysis, and decision making.

Finally, we start talking about something else: decision making. This is one of the connections between Tukey’s views on data analysis and statistics, and Luhn’s views on business intelligence.

Please check the timeline to remember the dates of the articles and presentations that I’m talking about.

Four years after Wu’s presentations (2001), two papers put everything together. In April of 2001, Cleveland proposed an action plan to enlarge the technical areas of the field of statistics, and he called it Data Science. And then, in August of the same year, Breiman proposed that the use of the algorithmic modeling (as a different statistical culture) will be better to solve problems with data, rather than the classical statistical modeling.

The two articles are relevant in different ways, Cleveland’s article aimed to create an academic plan to teach data science (similar to what Tukey did for data analysis) and Breiman’s article had the idea to talk about the practical implications of data science and its relation to business (close to what Luhn wanted to explain with an application).

Even though Cleveland’s article was directed to universities and educational institutes, he mentioned:

Universities have been chosen as the setting for implementation because they have been our traditional institutions for innovation […]. But a similar plan would apply to government research labs and corporate research organizations

So he’s recognizing the importance of the government and also organizations in the process of institutionalizing data science as a serious field.

In the article, Cleveland states that data science depends on four big things (he talks about six things, but taking out the parts related to teaching DS):

  • Multidisciplinary Projects. Here he mentions:

The single biggest stimulus of new tools and theories of data science is the analysis of data to solve problems posed in terms of the subject matter under investigation. Creative researchers, faced with problems posed by data, will respond with a wealth of new ideas that often apply much more widely than the particular data sets that gave rise to the ideas.

Important things to highlight here:

1.- Data analysis and data science have the primary goal of solving problems (that will be important when we talk about Breiman’s article)

2.- The practitioner of data science needs to work on different issues and fields to be able to have a bigger picture, to exploit creativity, and to understand different types of data and problems posed by data.

  • Models and Methods. Here he mentions:

The data analyst faces two critical tasks that employ statistical models and methods: (1) specification-the building of a model for the data; (2) estimation and distribution-formal, mathematical probabilistic inferences, conditional on the model, in which quantities of a model are estimated, and uncertainty is characterized by probability distributions.

Important to notice that he talks about the practitioner of data science as the data analyst, but we will refer to them as data scientists (something to think about).

In here we have to highlight that:

1.- Modeling is at the core of data science. This is the process of understanding the “reality”, the world around us, but creating a higher level prototype that will describe the things we are seeing, hearing, and feeling. Still, it’s a representative thing, not the “actual” or “real” thing. Tukey also talks about this in his articles.

2.- Data science needs a method (and a methodology).

3.- The data scientist creates models for the data and uses statistical techniques and methods to develop these methods. As we will see in Breiman’s article, he emphasizes algorithms instead of formal mathematical methods.

  • Computing With Data. Here he mentions:

Data analysis projects today rely on databases, computer and network hardware, and computer and network software. […] Along with computational methods, computing with data includes database management systems for data analysis, software systems for data analysis, and hardware systems for data analysis.

He also talks about the gap between statisticians and computer scientists:

[…] One current of work is data mining. But the benefit to the data analyst has been limited, because the knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited.

And one of his ideas is that a “merger of the knowledge [Statistics and Computer Science] bases would produce a powerful force for innovation”.

Some other things to highlight:

1.- The data scientists need an understanding of databases and computational software. Programming is there as well. He also talks about statistical packages and related software. But now we know that the path for data science nowadays depends on the understanding of some programming languages (mostly Python and R right now).

2.- Data science also depends on technological advances. This was true in 2001 and is true today as well. The methods that the data scientists use are shaped by the theoretical developments (check the timeline) but also on the fact that today we have powerful computers, cheaper and faster memory, high-speed internet, and also GPUs and TPUs.

3.- We need statisticians to learn computer science and computer scientists to learn statistics. This gap is filled right now by data scientists, but we can’t forget that moving between these fields is becoming more usual, and we need experts in statistics to learn computer science and experts in computer science to learn statistics, not only people that are proficient in both.

  • Theory: Here, he mentions:

Theory, both mathematical and non-mathematical theory, is vital to data science. Theoretical work needs to have a clearly delineated outcome for the data analyst, albeit indirect in many cases. Tools of data science-models and methods together with computational methods and computing systems-link data and theory. New data create the need for new tools. New tools need new theory to guide their development.

Data science is a practical field, but it needs theory to understand and explain their methods and models. Today we know that if you want to understand machine learning, you will need an understanding of linear algebra, differential calculus, statistics, and probability (to mention some of the most important).

Important things to highlight:

1.- The tools of data science and its models link the data and the theory. We need to understand the theory to create better models, and when we build models, we use all the theoretical tools.

2.- Different datasets need different theoretical backgrounds. This is clear in Tukey’s paper, where he mentions some of the most important pieces of mathematics and statistics to work with different datasets. We saw this when big data exploded, and we had to analyze disparate sources of data.

3. The theoretical advancements guide the creation of new tools and models. This reminds the history of science, where not only data and experiments led to the creation of new theories, but also, the new theories developed guided experiments, models, and tools.

The Algorithmic Modeling Culture in Data Science