Foundations of Data Science
Nowadays, all of us have heard that “Data is the new oil and the Data Science is the combustion engine that drives it” or “Data Science is the most sought after job of the twenty-first century” or “Data Science is the future”. In this article, we discuss what exactly is Data Science.
What is Data Science?
Before we start answering this question, it’s important to understand why there is so much confusion(something so popular) around it, why there is no clear understanding of what data science is.
One reason for this is that it’s an assortment of several tasks, there are different tasks involved in the Data Science pipeline and that’s why it’s not clear who can call himself/herself as a Data Scientist and from application to application, the importance of these tasks changes. In some environments, organization a particular task might be important and in other applications/environments/organizations, some other task might be more important. So, there is this uneven distribution/importance of these different tasks and hence there is this confusion that if I’m doing this task or say these two tasks, is it still called Data Science or not? So, we try to clear out this confusion and start by asking the question: what are these different tasks involved in the Data Science pipeline or the different tasks that the Data Scientist should know.
The different tasks involved in the Data Science pipeline are:
- To collect the data.
- To store the data.
- To process the data.
- To describe the data
- To model the data i.e to come up with models for the data
So, these are the tasks that we do in a typical Data Science pipeline or a Data Scientist has to know at least one or more of the skills required to perform these tasks efficiently.
It might be possible that someone is working in an environment where they don’t care about the modeling aspect or no modeling is required and all we care about is describing the data in a certain way, drawing graphs about it explaining the mean, median, mode, etc. about the data and that’s sufficient for that scenario as they don’t really want to prepare model.
And in some cases, we might not be really required to collect the data say we’re working in a data-rich organization, so the collection of data part may not be so much and the work might be more around processing the data like cleaning the data and so on. So, that’s why an assortment of different tasks and different importance in different applications is something that creates confusion about what constitutes Data Science.
Let’s look at each of the tasks involved in a typical Data Science pipeline in detail.
We start with the question of what is involved in data collection?
The answer to this question again depends on the question a data scientist is trying to answer, what kind of problem we’re interested in, and this also depends on the environment a data scientist is working in and by environment we mean the organization in which the person is working, whether it’s data-rich organization or its an NGO where we need to go out and collect the data and so on. So, these are the answers to this to the question of what is involved in data collection.
Let’s start with an example where a Data Scientist is working at an e-commerce company say Amazon or Flipkart which are data-rich organizations as they have a lot of customers coming in on their sites, a lot of data is collected on a daily basis. Now the one question a Data Scientist at these companies might be interested in: “Which items do customers buy together?” and the obvious reasons for answering this question is that the companies can then have targeted Ads say for example today someone bought a laptop so the customer might require an external mouse to go along with it or if the customer has bought a phone then he/she might be looking for additional memory or some other accessory to go along with it and so on. So, this is an important question.
So, let’s see how does a Data Scientist go about collecting the data for this. Given that this person is working with a data-rich organization say Amazon, data is already with the organization in this case or the data already exists, all there is that data scientist needs to know is some DB languages like the person must know to write SQL queries to access the data, and data scientist must also know how to write accompanying Python, Java code because it’s not just about the accessing data, we will have to do something more with it and probably embed the SQL queries inside Python or Java. So, what the person must know in such scenarios to collect structured data is to know a bit of programming and some interface with the Database.
It might be the case that Amazon is not storing this data in a structured format in a relational database but in a non-structured or semi-structured format such as a JSON or XML and in that case, the person must also know how to interact with these unstructured formats, so we need to have specific libraries which allow accessing data from the unstructured databases. So, that’s what we need to do to get the data in the case when data already exists.
Let’s look at another example: Say a Data Scientist is working for a political party say this party is of the current government and they have come up with a new policy and as a Data Scientist we are interested in knowing “what are people saying about the new policy?” or “what are the specific aspects of this policy which people are not liking or liking very much?” and so on and these questions than can help to make modifications to this policy or the new ones.
So, what happens in this case: Does the government already has the data? Does the government already know what the people are saying about this?
The answer to the above questions is both Yes and No. No, because the situation is different from Amazon which itself was storing data or collecting data. Here the data already exists, people are talking about it on Facebook, Twitter, Reddit discussion forums and so on. So, the data exists but the data is not owned by you in some sense, you are not storing it in your organization, it’s not getting stored in a structured relational database within the political party or within the government. Of course, the government may have its own website where it encourages people to come and leave the feedback and engage in discussions and so on but that might only be a part of the data, the main data may be on the public platforms which are readily accessible or which the citizens are more used to. So, in this case, as a Data Scientist, to collect data, you need to have a background in programming, you may need to have some hacking crawling skills, you need to know how to work with the APIs and fetch data using APIs.
Let’s look at another situation where a Data Scientist is working with farmers and with an Agricultural areas/regions, there are different factors which influence the yield for example in a particular state there might be certain Hybrid seeds that the government/NGOs are promoting because they are known to be effective in other areas. Or there are certain type of fertilizers that people want to use either the organic ones or maybe the more chemical fertilizers because they lead to a certain yields or so on or different irrigation methods say land in a particular region is water-starved and we want to rely more on Drip irrigation or other conservative ways of doing irrigation.
So, now we want to understand what is the effect of these different choices on the final yield and we can realize that these different choices are going to interact with each other, it’s not that one type of seed working with one type of fertilizer would be similar to the same hybrid seed working with a different type of fertilizer. So, we have these different choices and now we need to have data. Contrast to the other two scenarios discussed above, here the farmers, NGOs, or Government does not really have the data because we don’t really know in the entire state in all the farms, what are people really using; they might have an idea of all the fertilizers which go to a particular state but we don’t know in what quantities people are using it, are they actually using the hybrid seed or they are relying on some other seed which they already have and so on. This is one situation where the data is not with the organization, its not readily available on social media platforms, this is where we need to venture out and collect data and we need to design experiments to collect this data(we will discuss in another article how to do that but to give an idea say we are using one particular type of seed and if that does not give you a good yield, does that mean the seed was bad or the irrigation method was bad or was the fertilizer bad).
In summary, the different skills required for collecting the data:
Data scientists must have knowledge of Statistics. If we want to design experiments to collect data then the person must have sound knowledge of Statistics to understand how to collect data so that it is not biased, there are no confounding effects of different variables that might affect the final output variable/measurement we are trying to take.
Let’s look at the task of storing the data.
- Transactional and Operational Data
A large amount of data is transactional and operational data or at least traditionally it was transactional and operational data. What we mean by transactional and operational data is that they are things like the following:
So, the data in the above categories are the records of transactions that are happening and the things which happen on a daily basis for running the operations of an organization. So, these large organizations which have been existing for a long time say the 1950s and so on, they were generating a lot of such transactional and operational data and they wanted to store it somewhere so that it is preserved for later use and you can pull up these records whenever required and so on.
So, this kind of data which was prevalent and which has this structured form(sample image below) is in the form of a table and lot of such data has been created for many years and the need to store, access and update such data led to the entire development of this relational database technology.
And the main purpose of this relational database is to allow different operations on the data(select, insert, update, delete). So, these are the typical operations that we do in a relational database and relational database stores data in the form of tables as depicted in Figure 1 above.
2. Data from multiple databases
Then there are other kinds of data as well say the companies started growing in say 1990s and so on, there were these very very big organizations which have multiple sub-organizations within them let’s take an example of a very big bank, now this bank would have customers who have bank accounts in the bank but they might also have credit card, investment account(maybe they are buying mutual funds, stocks, or other kinds of policies that they are taking to the bank). So, all this data is actually stored in multiple databases within the same organization, we can think of it like the Accounts section which deals with the savings accounts, that unit of the bank would be different from the unit which deals with credit cards and the same thing holds for some other sector as well say you could be a customer of Airtel because you have a sim card from them or you have an airtel TV or broadband, so all of these are different sub-organizations within a larger organization and all of these sub-organizations/departments store data about customers. So, it might be the case that the sim and phone data is stored in one database and the broadband and metadata(since how many months you’re using this service, monthly bill, are you paying the bill on time or not and so on) related to it is stored in some other database and similarly airtel TV details might be in a different database.
This is not a very uncommon situation in very big organizations where multiple relational databases are storing data about different aspects often of the same customer because you could have all the 3 services as mentioned above from Airtel.
Now there is the need to integrate all of this data into one common repository and we want to do that to have support for Analytics say we want to see our customers who are using our sim, are they also using our broadband and if yes, are they happy with both the services or it’s the case that they are happy with the sim card service but not with the broadband service or the other way around. So, when data is stored in multiple databases and if we don’t link them, then we won’t be able to do any effective analysis of the data that we have. So, this led to the development of Data Warehouses.
Data Warehouse is the common repository where the data from multiple structured databases gets accumulated and integrated into this common repository. And the key distinction of Data warehouse from a single relational database is that the Data Warehouse is optimized for Analytics. What this means is that in a Relational Database, we typically have technologies that optimize for select or update or insert operations because those are the operations that we care most about, in a Data Warehouse we also have it optimized for Analytics say we have Data Warehouse optimize for operations that aggregate over a large number of rows(in the order of millions), so we have the rows(say the number of calls made on Airtel network in a day) in the order of millions and we want to aggregate or find out some difference in this large database, so these kind of Analytical operations are what these Data Warehouses are optimized for. So, that’s the main difference from a relational database.
- It’s a collection of many relational databases.
- It supports and is optimized for Analytical operations.
3. Unstructured Data
So, this was in the 1990s when the need for the Data Warehouses was felt. And now since 2003 and so on, there was this large penetration of mobile phones, better broadband connectivity and so on, we are generating even larger amounts of data today and a lot of this data is unstructured and not in a tabular format, for example, people are sending images, they are writing blogs, a lot of newspaper articles are coming online, all these is not structured data, it’s not in tabular format. Similarly, so many images are getting uploaded on social media websites, then there are videos on YouTube and other places, then we have speech data getting generated. All of the above is unstructured data and a large volume of this unstructured data is getting generated in the last few years or the past decade or a bit more than that. Few things which characterize this data is as follows:
- High volume — The amount of data getting generated is huge for example 500 hours of video content gets uploaded on YouTube every minute, that’s the order of magnitude at which the data is getting generated.
- High Variety — This data also has a high variety as it deals with different modalities, compare it with the situation in 1950’s and 1990’s where we mainly had structured data, now we have variety, we have text, we have image, we have video and speech data and in addition, we also have the structured transactional and operational data.
- High Velocity — The speed at which the data is getting generated is very high for example as mentioned above, we have 500 hours of content getting uploaded every minute.
These 3V’s(Volume, Variety, Velocity) is what characterizes the Big Data.
So, this Big Data comes with its own challenges of storage because now the formats are different, its unstructured and multiple varieties are there, so from Relational Databased we moved to Data Warehouses and something that we have today is known as Data Lakes.
Data Lakes is the collection of all sorts of data that flows from within or outside the organizations(say people are talking/giving a review of your product not on your website but on other discussion forums like Reddit). So, whatever data we have related to the organization be it structured, unstructured, currently usable, or not usable we don’t care about, everything that relates to our organization we keep dumping it our data lake and not really bother if that data is really usable right now and hoping that this data might be useful in some analytics in the future and this is the idea behind the data lakes.
- Data Wrangling or Data Munging
Let’s say we have a large organization and have our own databases inside and say we have used some courier services to send some packages outside the organization to some different location or something like that. And then we are receiving some data back from the courier service(JSON data in the below image on the right side) and we want to take this JSON data and put it back to our own database(say this package was something that some customer bought online from our site and now it has been shipped so we want to have the same details in our database).
Now what happens is that there might be some information in the JSON file which we are not storing back in the database and there might be some information in the JSON which is not in the right format and we have to transform that and store in the right format, so this collection is known as the process of “extract, transform and load” also known as “Data Wrangling” or “Data Munging”. One example of transform in the above image is that in the JSON file, we have one key by the name “package_content” and we are storing that information under the column named “item” in the database.
Similarly, the delivery date and delivery time are two separate fields/key in the JSON file whereas we are combining the two and making a timestamp from them and storing this timestamp in our database. Similarly, the receiver field contains the full name of the receiver of the package but we are storing this under two fields i.e First Name and Last Name in the database.
So, the role of Data Scientist in Data Wrangling is to integrate information from multiple sources. In the above example, it is also possible that the courier service is a part of the organization itself. So, it’s quite possible that the other part is some other department/organization within the parent organization and they are storing data in either structured or unstructured format.
So, we need to look at all the databases that we are trying to integrate and extract the relevant information from different databases and load it into the final/main database.
2. Data Cleaning:
After Data Wrangling, the next thing we need to do is “Data Cleaning”. For example, say our company is a Fitness based app service and there is this form for the users to fill out the details like weight and height and some people don’t provide say the Height value and now we need to figure out how to deal with these missing values.
Standardize Keywords: Say there is this e-commerce site where people can upload the images of the goods that they would like to sell and someone instead of mentioning the keyword “Half-sleeve”, they mentioned just “half”, so now we need to standardize keywords else it would affect the other operations. So, whenever we are receiving data from noisy sources, we need to clean it up and standardize the keywords and tags.
Then we have to correct the spelling errors, say we are receiving some data from an unreliable source, then there might be some mistakes in that and we need to correct that out. Then we also need to identify and remove outliers say someone enters their weight as 1000 kg then we know that it’s an outlier and we need to remove it as the other weights lie in the range say 40–150 kgs. So, we need to remove the outliers to ensure that the subsequent model that we build is built on the cleaned data.
3. Data Scaling, normalizing and standardizing:
Scaling Data: Say we receiving some distance data in kilometers but in the database, we want to store it in miles so we want to scale it appropriately so that it’s in the consistent form that we are storing.
Normalizing Data: Say we are collecting the heights data and it’s in cms. and if we plot it out, it would look like the normal distribution centered around 150(as depicted in the below image), so the average height is 150 cm and then we have distribution around that. So, now for various modeling purposes and also to get a better sense of different columns in our database say we have height, weight, age, income and so on in the database and to get a sense of what is high and what is low we normalize the data and we transform the data so that it has zero mean and unit variance, so after the normalization, whatever was 150 in the original data becomes 0 and the data around it spreads accordingly.
Standardizing Data: Similarly, we standardize the data for example in the above image under the standardized option, we have four different numbers in one column which is -10, 0, 10, 30 and again to ensure that all the different columns that we have in the database are comparable in terms of the range that they have, what we do is that we re-scale and shift these values so that they become between 0 to 1. So, in the above image, for the data under standardize, we have computed it as: we have subtracted the minimum value from all the entries and then divided it by the range(max-min).
Distributed Processing: If we have to process data and say we are dealing with Big Data and we want to do standardization for all items in a column, then we need to do something known as “Distributed Processing”. What this means is that we need to divide our data in smaller chunks, distribute them to multiple computers or cluster of computers/servers, do the partial computation on each of the servers and then aggregate the results. And this is exactly what Hadoop Map Reduce allows us to do.
It has two aspects. The first one is visualizing the data and the second one is summarizing the data. Let’s talk about each of them and start with visualizing the data:
- Visualizing Data
Say ours is an eCommerce platform where a lot of people can upload the images of their goods and say we have a large collection of shirts on our platform and we want to do a quick visualization of what this data looks like, say we have shirts of different colors, it might be easy to plot a bar graph with different possible colors and the bar representing the corresponding count. And from the plot, we can get to know how many shirts of each color or what is the most dominant color of shirt we have on our platform.
We can visualize the sales data across different products to make some sense out of it. If we don’t visualize the data, then the other option we have is that the data is stored in a relational database as one-one rows and we have some values for different attributes, we will have it in the form of a table and we won’t be able to make much sense of the data. A quick summary of the data in the form of visualization helps to describe the data effectively, quickly, draw some clear insights from it.
We could also have the plots to find a relation between two different variables say we want to know how the amount spent on marketing is related to the sales, is it the case that the sales increase if we put more money in marketing, or it increases up to a point then saturates and any more marketing does not help. We can have a scatter plot of the attributes for this.
2. Summarizing Data
Say a TV showroom has noted down the daily sales for one month and now we are interested in certain types of question(as in below image) that summarizes this data
Visualizing and Summarizing the data deals with Descriptive Statistics. It’s an iterative process we plot out different variables and see which plot makes sense and this is also known as Exploratory Data Analysis. This is what we mean by Exploratory Data Analysis: we look at the columns of the data, we draw some plots, compute summary statistics, see if the plot makes sense and then re-iterate again and find out the important columns in the data.
The fifth task involved in the Data Science pipeline is discussed in another article.
In this article, we discussed what Data Science is, what are the tasks involved in the Data Science pipeline and zoomed into four of the five tasks involved in the Data Science pipeline.