Original article was published by Mehmet A. on Artificial Intelligence on Medium
What is Data Science(DS) and How can it be learned?
Data Science(DS) is undoubtedly one of the most popular research and application areas of today. The number of people who want to learn DS, which should be regarded as an interdisciplinary field by its nature, is increasing day by day. As a Data Scientist and researcher, I decided to create a roadmap and share it with my followers in order to guide those who want to learn DS, in the light of the experiences I have gained through the DS trainings I have received and given.
Note: I usually will use some abbreviated words below:
- Data Science — DS
- Artificial Intelligence — AI
- Machine Learning — ML
- Big Data — BD
- Deep Learning — DL
Data in general; They are different types of knowledge acquired as a result of different actions such as observation, research and experience, often shaped in a certain way. For example, a written text, numbers, information and facts we store in our minds, or files in memory represent data.
1. What is the Data Science?
Let’s talk about the concept of “Data Science”, which has become very widespread in the world in recent years with the ease of producing, storing and transmitting data.
What is it real and where did it come from?
We can forget the complex definitions we have read so far and say that it is a method of extracting useful information from data for DS. Friend suggestions in social media applications (Instagram, Twitter, etc.), customization of ads on Google according to the user, and recommendation systems are a few examples that we see the power of this method.
So how does the process in DS progress?
An information emerges by applying various data analytics methods to the data sources we have. For example, if we think for a company, the fact that the company converts this emerging information into monetary gain by using it in appropriate actions shows that DS has been realized.
We live in a world where data flows in large amounts and diversity, and everyone talks about data as a very important source of information.
So what kind of information do we aim to access from the data?
Let me give a simple answer:
- We want to get useful information about the past, main and future from the data. In other words, the time period of the information we want to access is as important as whether this information is useful or not.
- When we come across observations similar to our data but we have not encountered before, we want to be able to make sense of them within the framework of our data. In other words, we want to classify our new observations we do not know within the categories we know.
The answer above expresses the purposes in terms of DS.
So, how does DS do it as it takes us to these goals, so what does it benefit from?
The method, which is the result of the combination of the Mathematics department, which covers learning concepts such as basic Mathematics, Statistical modeling and ML, Computer Science including programming languages, Database applications and Cloud systems, and Business-sector knowledge, where personal skills such as creativity and innovation are important, is DS.
So, before we talk about the work we aim with DS, let’s start by further explaining the four important elements in the composition of DS.
2. Components of DS
We can say that Modern DS is based on statistical modeling of the world. The discipline called Statistical Learning Theory aims to reach the solution of the problems through optimum parameters by expressing the problems we want to solve as statistical models. To make this a little more understandable, let’s give a small example:
For example, we want to understand how the dollar rate increases and decreases. Statistical Learning Theory offers us some methods for how to do this. Let’s say we decide to model the dollar rate with a simple linear regression model. Considering the variables that can affect the rate, interest and inflation, we established an equation:
dollar rate = constant + parameter1 x INTEREST + parameter2 x INFLATION + error
According to Statistical Learning Theory, it is possible to find the optimum values of the fixed, parameter1 and parameter2 variables that can be modeled by using historical data. These values should somehow minimize the error so that the optimum word can find its equivalent. I won’t go into how we did this right now. But when we find these optimum parameter values, we have a model of the dollar exchange rate.
Today, with the rapid development of computers and other devices with processing power, training the machines indirectly as above instead of directly coding makes Statistical Learning Theory a very important tool. In the field we call ML, we can actually say that Statistical Learning Theory is focused on codeable devices. If we give a shortly:
“ML is the discipline that aims computers to learn to do a task without being explicitly and directly programmed.”
ML does its “learning” job by looking at the data. In other words, the field we call ML consists of all the algorithms that take the data as input and can represent a task as a model. What I call task is to categorize a text, to recognize people in a photograph, to estimate the value of the dollar rate the next day, etc.
2. Computer Science:
DS, of course, does not only make use of Statistics. As important as Statistics is programming. While defining ML above, I mentioned that computers learn without being directly programmed. Although I seem to contradict myself here, the situation is actually different. Basically, DS uses programming to define algorithms that can learn without being programmed into computers. Of course, we don’t use programming languages just for this purpose. The purposes for which we use programming languages can be listed as follows:
- To pull data from repositories, databases or files.
- Data manipulation, cleaning and production.
- Visualizing the data.
- To generate descriptive Statistics by performing mathematical operations on data.
- To translate ML methods into codes that computers can understand.
- Training our models with data.
- Transfer our models to production systems to serve the outside world and keep our models alive in the production environment.
The above list are just a few of what we do with programming languages. But it is obvious how important each of them is in terms of today’s DS.
What about programming language?
Python or R?
What about Julia?
Java, Scala and Go?
There is a lot of writing written about this and I will not go into it. Let me talk about just one principle and close the subject:
Programming languages are tools. You don’t use a wrench when you need a screwdriver, or pliers when you need a wrench. The same goes for the choice of programming language in DS. However, if you are just starting out, you will have to pick someone and start. I exist and participate in the ongoing Python rumor that I have seen so far. I will write a separate short article on the Data Scientist later. Now I just write that word that I believe and pass it.
“A Data Scientist is a person who knows more statistics than a programmer, who knows more programming than a statistician.”
On the other hand, programming is not the only thing DS benefits from in Computer Science. In fact, a lot of things from distributed architectures to BD technologies fall into the toolset DS uses. To give an interesting example, GPUs, which were once developed to provide performance in computer games, have become an indispensable technology for ML.
3. Area information:
If you have done research on what DS is, you have seen that most of the explanations talk about Statistics and programming, but not field knowledge. There are many reasons why I insist on field knowledge.
“If you don’t have specialist knowledge specific to your field, you run the risk of being deceived by data.”
Actually, we do not need to deal with an issue so far. With a simple logic, we can think like this: Field knowledge guides us in the point of which data is useful, as well as shed light on our path at the point of causality. The expertise in which factors can lead to what results, to be sure, is worth beyond any data and method.
So, if we are discussing DS at an introductory level, we are not talking about specialties, but if we are talking about the basics, it is only because of laying the stones of the road to expertise.
Is DS a science?
Everything that can be falsified is scientific… Karl Popper.
Of course, if DS talks about falsifiable things and the results are falsifiable things, it can be considered a science. However, you will hear from some that DS is also an art. You ask why? Let’s illustrate briefly:
At this point, models created with Artificial Neural Networks produce satisfactory outputs in certain tasks. But unfortunately, we cannot explain exactly what these neural networks learned and how, we can only catch clues. That’s why you have read many articles describing Artificial Neural Networks as “black box” models. But we also don’t know exactly how our brains work, right? Still, no one doubts that the human brain is doing very special things.
If you develop models with Artificial Neural Networks, you will see that creating something original; It is sometimes a matter of intuition to predict which architectures will work better in the task you are working on. Yes, your intuition, which you have grown with past experiences of yourself and others, is one of the most helpful friends of a Data Scientist. The works you take without intuition will be those in which time flows like water from your palms in the effort of trial and error of a thousand and one possibilities.
3. What do we aim to do with DS?
- Making sense of the past with data!
While talking about what DS aims at, I said that we want to get useful information about the past from the data. Let’s open this up a bit and take a look at how we can find useful historical information from data:
For example, in your company, you have been given data on the number of past sales and asked to study how exchange rate movements affect sales. Let’s list the path you will follow roughly as follows:
- You will examine the sales data and try to fix the problems you identified in the data.
2. You will gather data that you think is useful in your analysis. For example, while the currency affects sales volume, it does so through several channels:
i) It affects your sales as it affects people’s purchasing power.
ii) It affects the demand for your products as it affects the prices of your products due to imported inputs.
iii) It affects the use of credit as it affects interest and therefore affects the demand for your products etc. In summary, in order to control the above influence channels in your analysis, you should consider variables such as Gross National Income, consumer loan interest and exchange rate pass-through of your products in your analysis and prepare the relevant data. Note that taking all these effects into account requires not only a mere statistical knowledge, but also a field knowledge of your field.
3. After the steps above, you applied a classic “data analysis workflow” and reached the findings of the effect of exchange rates on your sales volume.
The finding you have achieved is invaluable to your company. You have done a job to make sense of the past, but you have put forward one of the important building blocks of your company in formulating the exchange strategy to be followed in the future.
The example above is one of those in which studying the past aims to illuminate the future. In addition, there are things that need to be done just to understand the past. For example, measuring the impact of the development of Atlantic trade in the 15th century on Mediterranean trade.
- Living the Moment with Data
Let’s get to now. You know, there is a pattern like “living the moment”. It is an important issue to expand the moment we live in in terms of information because the most important thing we can control about the future is “now”. So how can data help us? Let’s start with an affair from Rumi:
One day a man builds a house and makes a deal with the walls of the house. Before the walls come down, he will inform the landlord so that the landlord and his family are not harmed. Years pass and one day suddenly the walls fall. The man calls out to the walls crumbling in tears and anger:
- You know we had a deal?
- What were you going to let me know? A noise is heard from the walls:
- We tried to report. When we opened our mouth to say something, you came with a mud and covered our mouth.
Let’s adapt this story to the present. You have a factory and you produce high value-added products in this factory with hundreds of machines in a thousand and one integration and automation. You have such machines that if they do not work, your entire production process comes to a halt. If one day these machines fail and your production fails, who will you be angry with?
The phenomenon known as the internet of things is so common. All these machines actually generate data about themselves and their work every second. If you can distill information from this data and detect anomalies in your machines, you can have time to take precautions. Here’s what you will do roughly:
- You will set up an infrastructure to receive and store data streaming live on your machines.
- You will learn from the mistakes you have made in the past. In other words, you will identify machine failures in your historical data and design statistical models that can capture failure situations.
- In addition to the second item, you will also design statistical models to detect situations that you might call anomalies.
- You will try to understand what your machines say without interrupting them.
In summary, it is possible for you to experience the moment you live with data more informed.
- Pursuing the Future with Data
Suppose the word “future” means “unknown”. The purpose here is to devote this section to the prediction of unknown / unseen things independent of the concept of time. If we consider that temporally future things are also things that are not yet known / seen, you can ignore the assumption here. However, let’s begin with the examination of a future example from time to time.
Let’s say you have some money to invest in. A heap of alternative investment tools: Stock market, currency, gold, deposits, real estate, Bitcoin, Ethereum, Ripple etc. Which one will you invest in? In what time do you expect a return? You have historical data for all these investment alternatives. If you can model the pricing of each of the alternatives, you can also predict future prices. Take the stock market for example and create a simple workflow:
- You have drawn the historical list of all the stocks listed on the stock exchange from BIST(Borsa Istanbul).
- You have also reached the data that you think affects the share prices. For example, you now have data on the balance sheets of some companies as well as macro variables such as interest rates, exchange rates, inflation and Gross Domestic Product.
- You have processed the data you have to use in your models.
- You have determined your alternative models to choose the best among them.
- You have decided to use the floating windows method to measure the performance of your models. You will move forward by designing a pseudo future in your historical data. For example, with 20 periods previous data, you will predict the 21st period as if it were to come. Then you will slide your window one period and predict the 22nd period with the previous 20 periods. So you will come until the last period.
- Simply by averaging your model’s predictive performance, you will know how well your model is doing.
- You have repeated the fifth and sixth items for all your model alternatives and selected the best performing model.
- Using the model you have chosen, you can now predict the real future with the last 20 periods of data in your hand!
So, do you like it?
When you can do something similar to the above for all alternative investment vehicles, you have a data-based prediction of where you should invest your money.
But the unknown is not just the future!
For example, you want to design a driverless car navigation system! What will you do? Will you code the rules that suggest how your vehicle should behave on which road by traveling all the roads on earth? Of course this is impossible. What you’re going to do is let your car learn how to move on roads it has never seen, by inputting what needs to be done on a sufficient number and variety of roads.
Let’s give one more example.
You run an educational institution, a school. You have received 100 new enrollments in the 1st grade of secondary school. You have some information about each student, but you do not know how to group these 100 students, who to put them in which classes.
What do you do?
At least, your data has something to tell you. You should listen to them first. If the number of new classrooms you will open for 100 students is 5, for example, you can start by dividing these 100 students into 5 groups according to the data you have. What if the data in your hand guides you and these 5 groups are in your mind as well.
Didn’t it get in your head?
So maybe you can try to open 6 classrooms or 4 classrooms. Maybe you will create more coherent groups. As long as you know how to decipher the language of the data.
4. AI, DS and ML
AI has never been among the things I have written so far.
So where is AI at work?
In response, we can consider using DS as a lever to explain the relationship between ML and AI. If you do a research online, you will find that people are a bit confused about the relationship between these three disciplines. While some include ML in DS, others tend to perceive it as a part of AI. We are not going to come out with a classification that will end all discussions. But let’s see what we understand in general.
Let’s start with an observation.
In the past, Physics was seen as a sub-branch of Philosophy. As time passed, physics advanced and is now seen as a science in its own right. Similarly, while Probability Theory was a field of research within Mathematics, it branched out, giving birth to what we call Statistics. Therefore, whether the fields or disciplines are evaluated within each other depends on the decision of the time. In conclusion, we can say that everything is subject to the stipulation of this principle.
According to its acceptance by everyone, DS is a discipline that aims to obtain useful information from data. From this point of view, its scope is very wide. Consequently, it can be said that it is too early to make a full definition. AI is a field that aims to enable computers to think like humans. Due to its nature, DS makes extensive use of ML algorithms. Similarly, AI is one of the disciplines that make extensive use of ML techniques today.
In particular, DL, one of the ML methodologies, has been the main development area that paved the way for AI in the last 10 years. But as far as you can see, ML is not a discipline in the entirety of AI, because ML includes descriptive algorithms as well as predictive algorithms. On the other hand, AI is one of the fields contributing to DS. A lot of techniques, from genetic algorithms to fuzzy logic, find application in DS. Therefore, it would not be wrong for us to think of these three fields as three different disciplines that have a lot in common.
That’s all I have to share about DS. It would make me happy if I could use it as an article that will give you a new, different perspective!
Do not forget!
“In life, the most real guide is science … MKA”
“and you can explain this with data… MA”
- James Wainscoathttps://handong1587.github.io/data_science/2015/10/09/data-science-resources.html