Guide to Solution Implementation

Source: Deep Learning on Medium

DATA ANALYTICS PIPELINE

Guide to Solution Implementation

Machine Learning Modelling Simplified

Photo by Quino Al on Unsplash

Solution Implementation is an iterative process which involves:

Data Collection

Data exploration

Data Handling

Model building on a test data set

Initial model iterations

Validation of the model

Phase 1: Data Collection

If you need hot water, don’t boil the Ocean. In the data problem statement stage, you should already determine what sort of data you will need. Do not try to work with the entire database from the start.

On the other hand, sometimes you may need additional sources of Data.

Phase 2: Data Exploration

Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

It is the process of uncovering valuable insights from large datasets, often with the assistance of advanced
statistical analysis and visualization. It involves running descriptive statistics of variables and checking for
correlations

Why EDA?

  • Detection of mistakes
  • Checking of assumptions
  • Preliminary selection of appropriate models
  • Determining relationships among the explanatory variables
  • Assessing the direction and rough size of relationships between explanatory and outcome variables.
  • Identify data types
  • Check for missing values
  • Grouping of data

Types of EDA

Univariate EDA

Univariate analysis is the simplest form of data analysis where the data being analyzed contains only one variable. Since it’s a single variable it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it. It is of two types:

Non-graphical Methods: They involve just the calculation of summary statistics

Graphical Methods: They summarize the data in a diagrammatic/pictorial way.

Multivariate EDA

Multivariate data analysis refers to any statistical technique used to analyze data that arises from more than
one variable. This essentially models reality where each situation, product, or decision involves more than a
single variable. Usually the multivariate EDA will be bivariate (looking at exactly two variables), but
occasionally it will involve three or more variables. It is of two types:

Non-graphical Methods: They involve just the calculation of summary statistics

Graphical Methods: They summarize the data in a diagrammatic/pictorial way.

Note: It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA before performing the actual multivariate EDA.

Phase 3: Data Handling

As you conduct the EDA, you will notice the parts of the data that need to be ‘cleaned’ to ensure your analysis goes smoothly. So the data handling phase goes hand in hand with the EDA phase.

Step 1: Clean Data

Photo by Daiga Ellaby on Unsplash

1)Check and convert data types:Data types may come in a wrong format which makes analysis difficult, for
example a column that is meant for numeric data may be stored as a string.
2) Drop irrelevant colums, that is columns which do not contribute to the analysis.
3) Use .duplicated() function to detect duplicated values and drop them.
4) Inconsistent data entries: Inconsistent data entries are the representation of the same value in different
ways. This maybe due to white spaces, different letter cases, punctuation marks etc.
5) String Manipulation: String manipulation is an essential way of obtaining numeric data from strings.

Step 2: Handle missing data

1) Drop rows or colums with missing values or Nan values.
2) Fill in missing values or Nan values manually.
3) Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values.
The types of imputation are:
Mean / Mode / Median imputation: It is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative
attribute) of all known values of that variable.
• Substitution: Imputation is done by replacing the original value with a different value.
• Hot deck: A randomly chosen value from an individual in the sample who has similar values on
other variables.
• Cold deck: A systematically chosen value from an individual who has similar values on other
variables.
• Regression imputation: The predicted value is obtained by regressing the missing variable on
other variables.
• Stochastic regression imputation: The predicted value from a regression plus a random residual
value.
• Interpolation and extrapolation: An estimated value from other observations from the same
individual. It usually only works in longitudinal data.

Step 3: Data partitioning

Split the data, ideally in the ratio of 60:20:20 as:
Train
Test
Validate

Step 4: Handling Outliers
An outlier is a data point that differs significantly from other observations due to variability in the measurement or it may indicate experimental error.
1) Delete outlier values.
2) Transform variables to eliminate outliers. The natural log of a value reduces the variation caused by
extreme values.
3) Use Imputation only on artificial outliers. Mean / Mode / Median imputation is one of the most frequently
used methods.

Phase 5: Data Mining

Data mining is the process of turning raw data into useful information. By looking for patterns in large batches of data, you can learn a lot of valuable information. To conduct data mining, use the popular technique known as CRISP-DM.

Step 1: Business Understanding
Understanding the project objectives and requirements, then converting this knowledge into a data mining
problem definition and a preliminary plan. This has already been covered in the ‘Problem Statement
Definition’ stage.

Step 2: Data Understanding
Starts with an initial data collection and proceeds with activities in order to get familiar with the data. This is
covered in the ‘Data Handling Phase’.

Step 3: Data Preparation
The data preparation phase covers the preparation of data to construct the final dataset from the initial raw
data. This is covered in the ‘Data Handling Phase’.

Step 4: Modeling
This is where we select the sort of deep learning or machine learning model that would best suit our needs.

Step 5: Evaluation
Once the models have been built, they need to be tested to ensure that they generalize against unseen data
and are not under fitted or overfitted.

Underfitting:
Underfitting is a modeling error which occurs when a function does not fit the data points well enough. It is
the result of a simple model with an insufficient number of training points. A model that is under fitted is
inaccurate because the trend does not reflect the reality of the data.

How to Overcome Underfitting:
• Get more training data.
• Increase the size or number of parameters in the model.
• Increase the complexity or type of the model.
• Increasing the training time until cost function in the model is minimised.

Overfitting:
Overfitting is a modeling error which occurs when a function is too closely fit to a limited set of data points.
It is the result of an overly complex model with an excessive number of training points. A model that is
overfitted is inaccurate because the model has effectively memorized existing data points.

How to Overcome Overfitting:
• Cross-validation

This is done by splitting your dataset into ‘test’ data and ‘train’ data. Build the model using the ‘train’ set.
The ‘test’ set is used for in-time validation. This way you know what the expected output is and you will
easily be able to judge the accuracy of your model.
• Regularization
This is a form of regression, that regularizes or shrinks the coefficient estimates towards zero. This technique
discourages learning a more complex model.
• Early stopping
When training a learner with an iterative method, you stop the training process before the final iteration. This
prevents the model from memorizing the dataset.
• Pruning
This technique applies to decision trees.
Pre-pruning: Stop ‘growing’ the tree earlier before it perfectly classifies the training set.
Post-pruning: Allows the tree to ‘grow’ perfectly classify the training set, and then post prune the tree.
• Dropout
This is a technique where randomly selected neurons are ignored during training.

Step 6: Deployment
This means deploying a code representation of the model into an operating system to score or categorize new
unseen data as it arises and to create a mechanism for the use of that new information in the solution of the
original problem.

Phase 5: Prototyping

Step 1: This is where you begin creating a basic version of the selected models and then compare
the performance of each.
Step 2: Now that you have narrowed it down to a few top performing models, fine tune them and
compare again.
Step 3: Continue step 2 until you have created the desired model.

Storytelling

The final step in a data science research project is to communicate the findings to the relevant stakeholders.
This point is where the data scientist or research team needs to communicate the actions that should be taken
based on their findings are by consolidating them in a report or presentation. Ideally, all research projects
would end in a deeper understanding, in order to justify the investment of time spent researching.
Step 1: Focus on explanatory analysis over exploratory analysis. Explanatory analysis presents an important
finding or recommendation, then explains the process that was taken to get there. Findings that are merely
interesting and not useful are saved for in-depth descriptions of the project, or not included at all.
Step 2: Select the graph that best represents your findings
Start a visualization with writing out what needs to be communicated, then create exactly that. Often it is
easier to create a set of charts and graphs, then pull insights and craft a story around what has been created.

What is Color theory?

Color theory encompasses a multitude of definitions, concepts and design applications. However,
there are three basic categories of color theory that are logical and useful : The color wheel, color
harmony, and the context of how colors are used.
Color theories create a logical structure for color.

How to Use Color Theory for Graphs

• Use branded colors for marketing materials or presentations. Using the company’s color scheme
helps you align with your brand and keeps your messaging consistent. It also helps with brand
recognition.
• Gradient colors can be great to show a pattern. Consider showing your most important values
with bars and use colors to only show categories.
• If you need more than seven colors in a chart, consider using another chart type or to group
categories together.
• Consider using the same color for the same variables. If you are making a series of charts that
involve the same variable, keep the color for each variable consistent in all the charts.
• Make sure to explain to readers what your colors encode. Every element of your graph should be
explained: What does the height of the bar mean? What does the size of the markers on a map
represent?

• Using grey for less important elements in your chart makes your highlight colors (which should
be reserved for your most important data points) stick out even more. Grey is also helpful for
general context data, less important annotations.
• Make sure your contrasts are high enough. In addition to having a high contrast ratio, avoid
complementary hues (e.g. red and green, orange and blue) and bright colors for backgrounds.
• Semantic color association. When choosing a color palette, consider their meaning in the culture
of your target audience. If possible, use colors that readers will associate with your data. Eg: Red
is danger.
• Use light colors for low values and dark colors for high values. When using color gradients, make
sure that the bright colors represent low values, while the dark colors represent high values.
• Don’t use a gradient color palette for categories and the other way round. Viewers will associate
dark colors with “more/high” and bright colors with “less/low”, such a color palette will imply a
ranking of your categories. If the chart is too colorful, consider another chart type for your data.
• Consider using two hues for a gradient, not just one.
• Consider using diverging color gradients. If you want to emphasize how a variable diverts from a
baseline, you may want to consider using a diverging palette.
• Using different lightnesses in your gradients and color palettes has the big advantage that readers
with a color vision deficiency will still be able to distinguish your colors.

Avoid using too much color in your visuals.