What fits you as a data scientist?

Original article was published on Artificial Intelligence on Medium


What fits you as a data scientist?

Discover your place and get the right direction

Photo by Monty Allen on Unsplash

Data science strives to understand the natural world, which is, by nature, very complicated. But how? Analyzing data, a significant amount of data (so-called big data), trying to understand them and squeeze knowledge and experience, to make decisions and solve problems. For a better understanding of what is the experience in the field of data science and machine learning, please check my introductory article about machine learning and artificial intelligence (link below).

As a data scientist, the first thing to know is the data lifecycle, which is made of steps.

Data Collection

Nowadays, data collection is an easy task. Data collection is the action of gathering data from various sources: web pages, news, social media, reports, graphs, tables, etc. are all sources of digital raw data, ready to consume for everyone interested.

Flows of data — from Giphy

In this field, a good data scientist develops an inherent curiosity about the world; he is data-driven, so he spends enormous amounts of time collecting data to answer the questions of interest. The required skills are:

  • think about what data are needed to solve the problem you are involved in
  • knowing how to collect data from various sources and how to combine them in a structured way
  • knowing some tools or application for the collection of data and ETL (Extract, Transform and Load)

Data Cleaning

Once collected, most of the time, raw data are “messy”.

A very messy office with a bunch of documents and raw data to organize — Photo by Wonderlane on Unsplash

Data cleaning is a complex task, which involves:

  • detecting and correcting corrupt or inaccurate data because of partial or missing data gathering
  • validation of data and estimate of missing values, based on information about the relevant phenomena, relied on the problem.
  • data enhancement, via harmonization and normalization of data
  • transformation of data to obtain uniformity and comparability of values in the dataset

Exploratory Data Analysis

Exploratory data analysis (EDA) is a collection of techniques for seeing what the data can tell us. In EDA, we use both mathematical models and common sense to cope with the significance of our data.

Graphical representation of data — Photo by Stephen Dawson on Unsplash

As data scientists, we must know what to expect from the data we collect, have to formulate a hypothesis, and “fill the gap” in what information we have.

There are many tools to help us:

  • Descriptive statistics: to have a representation of data, made by tables, graphs, summarizing values, etc.
  • Inferential statistics: to bring our collection of data, which is un incomplete representation of reality, to infer, make assumptions, about the fundamental characteristics of the phenomena.
  • A deep understanding of the environment, that is to say, the context of the problem we try to solve with data science techniques.

It’s worth recalling that, in classical machine learning (ML), this phase of the data lifecycle is up to us, the data scientists. When in deep learning (DL) and even more in reinforcement learning (RL), it is up to the model, the machine, to cope with that. In DL, during the training phase, the algorithm learns the characteristics of data provided and adapt to them. In RL, the environment is, even more, an active part of the learning process.

Model Building

Model building is a fundamental part of the ML process. When we create a model, we can then train a machine to learn patterns on our data (training set) to predict unknown or future data.

Be creative … it’s time to model building! — Photo by Jo Szczepanska on Unsplash

In the model building, we try to predict outcomes from the analysis.

Again, some skills are essential here:

  • apply the right learning schema to our data to solve a specific problem (regression, classification, association, clustering, etc…)
  • test and evaluate the results of the model training via defined metrics to compute performance
  • combine many techniques and models to get to a better result in terms of prediction, robustness of the model, etc. (ensemble modeling)

Model Deployment

When our model is ready, and we get good results from the training and test set (training and evaluation stage), it’s time to put it into production. This is the final stage when we get results from data, for business, study, research, or maybe also for fun!

Time to get results — Photo by NeONBRAND on Unsplash

We need to know:

  • how to deploy our model on various ready to use, state of the art framework. Think about, i.e., in Python tools and libraries such as NumPy, pandas, scikit-learn (ML), TensorFlow Keras, PyTorch (DL), openai tools for RL, etc.
  • how to get results into the production environment, for optimization, anomaly detection, automation, prediction, etc.
  • how to summarize results to stakeholder

So, what’s next?

So far, we talked about the process. But, who’s needed in every step? Let’s clear the concepts and what are the roles involved.

First of all, let’s summarize all the process with a picture

roles and data pipeline — by the Author

As you can see, a data scientist is expected to do everything from data collection to model deployment; he must be aware of the real problems, and he has to know many techniques about every stage of the process. So, the required skills are:

  • a grasp of how to do SQL and other methods of querying datasets
  • a deep understanding of algebra, statistics and set theory for useful modeling techniques of data
  • knowing about Python, R, Java, C++ or other languages for data cleaning, data manipulation, EDA and visualization.
  • ability to select or combine modeling techniques suitable for solving the problems based on data and expected results
  • knowing how to combine the data pipeline in a production environment with methods of visualization and presentation of the results, in the form of a web application, reports, commands to machines, etc.

A data engineer is more focused on data collection and data cleaning. In this position, we must be very expert in database and data query techniques, besides being able in ETL (extract, transform and load) of data from various sources. Then we must know how to clean data, deal with null or inconsistent values and many more techniques to build a strong foundation of sources for the following ML models.

A data analyst works hard on data cleaning and EDA. He masters statistics, both descriptive than inferential, and is always trying to squeeze every single bit of information from data. His role is crucial for data modeling and to build reliable models that can actually capture the behavior of the environment. In the data science path to knowledge, in my opinion, this can be the first step, and then we can explore other phases of the process.

A machine learning engineer knows how to get the most out of the data, based on various techniques and ML algorithms; he masters ML models, optimization of hyperparameters, evaluation and metrics, and is on the edge of the latest research in the field. Besides that, he also knows how to scale and deploy models into production systems.