Data engineering before data science
I am sure that almost everybody who is reading this article has heard about deep learning before. For example, you could have read some article about how convolutional neural networks work (CNN), or maybe it was about recurrent neural nets (RNN). If it is your case, well, you already know the science of deep learning and the theoretical part of it. If, on the other hand, you never heard or read some article about that before, don’t worry, it is a good one to start on this beautiful world.
In contrast with the previous paragraph, I am also sure, that more than one of you have never heard about the data engineering existing before every machine learning project in the real world. Usually, we collect data from external people, and, unfortunately, we will find a lot of issues, for instance, miss data, wrong values, including, for example, a name in the age column or something like that. So I am going to introduce a little bit yourselves in the data engineering world, and more concretely in ETL.
ETL is the acronym of Extract, Transform and Load. So that is it, first of all, we need to extract data from the source, this task is not always trivial, sometimes the information is just in a plain text file. After that, you must take a look at your data and see what you have and how is it, we are going to dive deeper into the transformation part later, don’t worry about that. Finally, after the extraction and transformation, you must store the data in the best way to work with it. Even though you think that the data were in a good form, you should spend few minutes of your time in see what possibilities you have, maybe is feasible a better way.
By now we already know that we must treat the data before our project, furthermore, we associate ETL with these previous steps. But, what do you need to change, where do you need to look up, how can you improve your dataset. Here we have some tips:
I hope all of you already know a little bit about what is categorize or normalize data, but anyway, I am going to explain a bit about these techniques. Of course, it is just an introduction to ETL and is only the beginning.
Categorize Data: Sometimes, if you are developing a neural net to classify something, for example, if one client is going to buy something or not, almost always you will have information about he/she in plain text, it could be his city or whatever. As you well know, words or characters are invalid values, so you must categorize it, it means to assign a numerical value to each different possible value of the feature. e. g. New York = 1.
Feature Study: In some cases, you have a lot of variables describing your objective but, not every feature is always useful. Comming back to the previous example, you are building a model which is going to determine whether one person will buy something or not. So, you have a lot of features describing each individual in your database, but not all of them have the same weight to determine if the person is going to buy or not. Thus, it is a good practice studying the variables of your data set in order to keep the most important to infer the final label.
Outliers and missed values: This part might have to be done before feature study, but the order now is not as important as the concept. It is usual data sets having some miss values and some outliers. Therefore, if you want to avoid noise in your data set and improve your results, you must clean it. There are different ways to address the problem, one is to delete every row which has some outlier or missed value. The second one is to change every outlier or missed value by the average or the most repeated value of the column and, last but not least, the option of use machine learning to infer the miss values and outliers.
Normalize: To finalize the process (actually, this process can be much longer, but for this article is enough), your machine learning model cannot work properly if you do not normalize the data. It means every variable are in the same scale, recovering the previous example, if you have as features the age and the annual income, the algorithm cannot fit correctly due to the huge difference between variables. Thus, to avoid that, you need to put all the variables in the same range, i.e. (-1,1).
Hope this article was interesting and give you some insights about what ETL and data preprocessing is but, if some of you are more familiarized with ETL processes, you might miss one hot encoding or some other usual techniques, but the purpose of the article was more introductive to ETL rather than specialize on it.
Source: Deep Learning on Medium