Back to Basics: the pipeline

The ‘Back to Basics’ series is targeted for people who are just starting to learn machine learning. It covers a wide range of topics, in fact, anything and everything related to machine learning. The complexity level will vary since some of the topics will cover mathematical and statistical underpinnings. The motivation for the series came after spending a fair share of late nights debugging / troubleshooting deep neural networks. It seemed time to take a breather — revisit the basics and shore it up.

As the series takes shape, it is hoped that deeper insight is gained and an impetus created to pursue new concepts/research in this fast changing field. In the first part, groundwork is created through scenario analysis that evaluate various aspects of a system that can learn, generalize and predict.

Define the objective

A clear problem statement and an understanding of how the solution will be used indicate the scope of analysis needed and the type of potential solutions. Problem types include that of optimization, automation, control, and prediction. Solutions can be classified into pattern discovery (unsupervised learning), predictive analytics (supervised learning), or some form of decision-making as seen in reinforcement learning applications.

Not all projects are data science projects. And not all data science projects require machine learning. Machine-learning relies heavily on probabilistic reasoning and is not an exact science. Hence, machine learning may not be the right choice in all cases.

If there is one thing that can shut down a ML project faster than a disillusioned sponsor, it is the availability, or lack thereof, of lots (and lots) of data. Data could be either unavailable or inaccurate, incomplete and misleading. In any case, the existence of relevant data should be verified and appropriate data sources identified along with a storage/sync strategy. It is very important to have a well-defined hypothesis early on to ensure that the model is grounded on solid reasoning rather than data mining.

Good visualization tools for exploratory data analysis can help demystify complexity while improving insight into the problem domain. This enables a better understanding of the structure of the data, its underlying patterns along with feature differences, contrasts and correlations. More often than not, data will be multi-dimensional increasing the complexity level by a couple of notches. Finally, governance processes to maintain data integrity and model currency should not be neglected for too long.

A typical ML pipeline

A typical ML pipeline takes the form of data pre-processing -> training -> verifying performance on out of sample data. These are typically automated to avoid information leakage. Specific actions performed here are described in a model definition which serves as the blueprint of the entire system.

The rest of the article evaluates some of the design considerations when creating the model. The approach used here can be easily abstracted to create a repeatable, consistent requirements gathering framework that can be applied to any machine learning problem.

Model Definition

A well-thought feature identification strategy and an algorithm selection strategy form the pillars of a robust and scalable machine learning model.

Feature Engineering: How well a model generalizes is highly correlated to the predictive strength of its underlying features. The process to identify features is both complex and time-consuming. Libraries exist that generate scores for feature set optimization and/or separate valid signals from noise. Deep learning provides some capabilities in this regard a well. However, for the most part, feature engineering is a crucial activity and requires some domain expertise to be effective. Below are some design choices to evaluate.

  • Feature encoding — Do features need to be categorized or abstracted for a better representation of the underlying distribution? Are dummy variables needed?
  • Feature interaction — Is there any interaction (positive or negative) between features? Combining or removing features might increase accuracy.
  • PCA — Are features highly correlated or clustered together? Will the model benefit from dimensionality reduction or feature extraction?
  • Noise — Is there noise in the data due to irrelevant features? Will the model benefit from discarding or filtering any features?
  • Size — As feature size grows, memory and computing power requirements increase. A balance is needed between speed and accuracy.

Algorithm Selection: Algorithms can be categorized into three major learning types — Supervised (to predict given some input), Unsupervised (to understand underlying patterns) and Reinforcement learning (to select the best course of action given a payoff structure).

  • Algorithm selection — What is the problem type? How will be final product be used by the end-users? Select an appropriate algorithm based on the problem type and usage — Supervised (regression algorithms such as OLS, SVM, Decision Trees, Ensemble methods, Naive Bayes, etc), Unsupervised (K-means, etc), Reinforcement (q-learning, etc).
  • Neural networks — Will neural network perform, better than a traditional one-shot algorithm? NNs are typically specific to supervised learning although this is changing. NNs are very good at separating out a non-linear problem space. Use the appropriate non-linear kernels in a traditional Ml algorithm or use one of the neural networks for a non-linear problem space.
  • Hyper-parameter tuning — What parameters need to be fine-tuned? This varies significantly based on the algorithm. However, the process can be that of a brute force optimization or semi-automated with OTS libraries such as scikit-learn.
  • Model performance metrics — How will the model be evaluated for performance? Use the appropriate metric for the learning type — Supervised (precision, accuracy, recall, confusion matrix, etc), Unsupervised (Silhoutte coefficient, etc)
  • High Variance — What is the over-fitting strategy? Use an ensemble method such as bagging and boosting or add more training data. Alternatively, select a model that best fits the true regularities or potentially combine multiple models

Neural networks add a layer of sophistication to increase the predictive strength of the model. Over the past few years, deep learning networks (NNs with more than one hidden layer) have been used quite successfully to increase accuracy multiple folds. Based on its architecture, NNs can be further classified as — MLP (feed forward multi-layer perceptron), RNN (recurrent NN for sequence data such as time series), CNN (consolutional NN for image analysis), LSTM (long-short term memory NN; a specific type of RNN), etc.

The remaining design choices are applicable for neural networks only. Specific options that are unique to a type of NN such as a CNN or RNN are not covered here.

  • Network Architecture — What is the optimal network architecture? How many hidden layers? How many neurons in each layer? Will the model benefit from a shallow or deep or a wide architecture?
  • Parameter initialization — What is the weight initialization strategy? What about the bias term? A naive zero starting weight will be unable to create the necessary asymmetry conditions. Options include small random numbers, Xavier uniform initializer, Glorot initialization, etc.
  • Activations — What activations are used for the hidden and the output layers? Options include ReLU, sigmoid, softmax, tanh, etc.
  • Loss function — Which cost / loss function makes the most sense? Options include mse, cross-entropy, maximum likelihood, etc
  • Optimization —Which cost optimization technique should be used? Options include stochastic gradient descent, an old workhorse but needs manual fine-tuning of the learning rate and decay rate, or one of the adaptive optimization techniques such as RMSProp and Adam.
  • Over-fitting — What is the over-fitting strategy? Options include lowering the number of layers and/or neurons, dropouts, adding more data, early-stopping and regularization.

Data pre-processing

Data pre-processing is a series of transformations that converts raw data into something that best satisfies the model specification. A majority of the rules applied here are derived from the model definition covered in previous sections. In addition, other areas to evaluate are:

  • Bad data — Are there missing values? What are the rules to impute values?
  • Skewed data — Did the exploratory data analysis identify outliers or skewed class distributions? Is there a presence of bias due to variations in training data?
  • Normalization — Is normalization or standardization needed? This is typically required when features with differing means are combined or data from different sources are combined.
  • Training data size — An adequate data size keeps the risk of over-fitting low. If there is not enough data, is upsampling or data augmentation possible?


The ability of a machine learning system to generalize depends upon how well the system is trained. Training involves feeding pre-processed, sufficiently shuffled subset of data into a machine learning system, such as a deep neural network, and generating a set of optimized parameters. These parameters are then used to answer questions or predict outcomes on data not seen before.

  • Data splits — What is the breakup between training, validation and test data? Does a k-fold validation strategy make more sense?
  • Hyper-parameters — What is the optimal batch size and number of epochs?
  • Over-fitting — Is an early stopping strategy needed?
  • Custom tests — Is there a need to create custom tests to ensure that the model does not deviate too far from its hypothesis?

For the most part, a project will undergo multiple iterations to obtain a model with satisfactory predictive strength. During the iterations, feature sets may get refined, hyper parameters fine-tuned, algorithms changed, etc. An accurate change history can be very helpful to reflect on failed tests for particularly difficult problems. Finally, once the model is in production, it should be periodically evaluated to ensure good fit as real-world data changes over time and for ongoing optimization.


In an attempt to keep this to a readable size, only the most important design areas are covered here. The detailed analysis for each topic will be added and linked from this page as those become available. This page should be considered as a launching pad for further research into the associated concepts. Finally, reader feedback is much appreciated in case of any errors or gaps.

Source: Deep Learning on Medium