Source: Deep Learning on Medium
By: Óscar D. Lara-Yejas
There’s a huge difference between the purely academic exercise of training Machine Learning (ML) models versus building end-to-end Data Science solutions to real enterprise problems. This article summarizes the lessons learned after two years of our team engaging with dozens of enterprise clients from different industries including manufacturing, financial services, retail, entertainment, and healthcare, among others. What are the most common ML problems faced by the enterprise? What is beyond training an ML model? How to address data preparation? How to scale to large datasets? Why is feature engineering so crucial? How to go from a model to a fully capable system in production? Do I need a Data Science platform if every single data science tool is available in the open source? These are some of the questions that will be addressed, exposing some challenges, pitfalls, and best practices through specific industry examples.
0. ML is not only training models
I’ve realized this is a pervasive misconception. When I interview aspiring data scientists, I usually ask:
“Say you’re given a dataset with certain characteristics with the goal of predicting a certain variable, what would you do?”
To my dismay, their answer is often something along these lines:
“I’ll split the dataset into training/testing, run Logistic Regression, Random Forest, SVM, Deep Learning, XGBoost,… (and a few more unheard-of algorithms), then compute precision, recall, F1-score,… (and a few more unheard-of metrics), to finally select the best model”.
But then, I ask them:
“Have you even taken a look at the data? What if you have missing values? What if you have wrong values/bad data? How do you map your categorical variables? How do you do feature engineering?”
In this article, I go over the seven steps required to be successful creating an end-to-end machine learning system, including data collection, data curation, data exploration, feature extraction, model training, evaluation, and deployment.
1. Gimme some data!
As data scientists, data are evidently our main resource. But sometimes, even getting the data can be challenging and it could take weeks or even months for the data science team to obtain the right data assets. Some of the challenges are:
- Access: most enterprise data are very sensitive, especially when dealing with government, healthcare, and financial industries. Non-disclosure agreements (NDAs) are standard procedure when it comes to sharing data assets.
- Data dispersion: it’s not uncommon to see cases where data are scattered across different units within the organization, requiring approvals from not one but different parties.
- Expertise: having access to the data is often not sufficient as there may be so many sources that only a subject matter expert (SME) would know how to navigate the data lake and provide the data science team with the right data assets. SMEs may also become a bottleneck for a data science project as they’re usually swamped with core enterprise operations.
- Privacy: obfuscation and anonymization have become research areas on their own and are imperative when dealing with sensitive data.
- Labels: having the ground truths or labels available is usually helpful as it allows to apply a wide range of supervised learning algorithms. Yet, in some cases, labeling the data may be too expensive or labels might be unavailable due to legal restrictions. Unsupervised methods such as clustering are useful in these situations.
- Data generators: an alternative when data or labels are not available is to simulate them. When implementing data generators, it is useful to have some information on the data schema, the probability distributions for numeric variables, and the category distributions for nominal ones. If the data are unstructured, Tumblr is a great source for labeled images while Twitter may be a great source for free text. Kaggle also offers a variety of datasets and solutions on a number of domains and industries.
2. Big data is often not so big
This is a controversial one, especially after all the hype made by big data vendors in the past decade, emphasizing the need for scalability and performance. Nonetheless, we need to make a distinction between raw data (i.e., all the pieces that may or not be relevant for the problem at hand) and a feature set (i.e., the input matrix to the ML algorithms). The process of going from the raw data to a feature set is called data preparation and it usually involves:
- Discarding invalid/incomplete/dirty data which, in our experience, could be up to half of the records.
- Aggregating one or more datasets, including operations such as joins and group aggregators.
- Feature selection/extraction, e.g., removing features that may be irrelevant such as unique ID’s and applying other dimensionality reduction techniques such as Principal Component Analysis (PCA).
- Using sparse data representation or feature hashers to reduce the memory footprint of datasets with many zero values.
After all the data preparation steps have been completed, it’s not hard to realize that the final feature set — which will be the input of the Machine Learning model — will be much smaller; and, it is not uncommon to see cases where in-memory frameworks such as R or scikit-learn are sufficient to train models. In cases where even the feature set is huge, big data tools such as Apache Spark come handy yet they may have a limited set of algorithms available.
3. You dirty data!
Yes, I better tell you something you didn’t know already, but I can’t emphasize this enough. Data are dirty. In most of our engagements, clients are proud and excited to talk about their data lakes, how beautiful their data lakes are, and how many insights they can’t wait to get out of them. So, as data scientists, this becomes our mental picture:
Nonetheless, when they actually share their data, it actually looks more like this:
This is where scalable frameworks such as Apache Spark are crucial as all of the data curation transformations will need to be performed on the entire raw data. A few typical curation tasks are:
- Outlier detection: a negative age, a floating point zipcode, or a credit score of zero are just a few examples of invalid data. Not correcting this values may introduce high bias when training the model.
- Missing/incorrect value imputation: the obvious way to address incorrect/missing values is to simply discard them. An alternative is imputation, i.e., replacing missing/incorrect values by the mean, median, or the mode of the corresponding attribute. Another option is interpolation, i.e., building a model to predict the attribute with missing values. Finally, domain knowledge may also be used for imputation. Say we’re dealing with patient data and there’s an attribute indicating whether a patient has had cancer. If such information is missing, one could look into the appointments dataset and find out whether the patient has had any appointments with an oncologist.
- Dummy-coding and feature hashing: these are useful to turn categorical data into numeric, especially for coefficient-based algorithms. Say there’s an attribute state which indicates states of the USA (e.g., FL, CA, AZ). Mapping FL to 1, CA to 2 ,and AZ to 3 introduces a sense order and magnitude, meaning AZ would be greater than FL and CA would be twice as big as FL. One-hot encoding — also called dummy-coding — addresses this issue by mapping a categorical column into multiple binary columns, one for each category value.
- Scaling: coefficient-based algorithms experience bias when features are in different scales. Say age is given in years within [0, 100], whereas salary is given in dollars within [0, 100,000]. The optimization algorithm may assign more weight to salary, just because it has a higher absolute magnitude. Consequently, normalization is usually advisable and common methods include z-scoring or standardization (when the data are normal) and min-max feature scaling.
- Binning: mapping a real-valued column into different categories can be useful, for example, to turn a regression problem into a classification one. Say you’re interested in predicting arrival delay of flights in minutes. An alternative would be to predict whether the flight is going to be early, on time, or late, defining ranges for each category.
4. It’s all about Feature Engineering
In a nutshell, features are the characteristics from which the ML algorithm will learn. As it is expected, noisy or irrelevant features can affect the quality of the model so it is critical to have good features. A few strategies for feature engineering are:
- Define what you want to predict. What would each instance represent? A customer? A transaction? A patient? A ticket? Make sure each row of the feature set corresponds to one instance.
- Avoid unique ID’s. Not only are they irrelevant in most cases, but they can lead to serious overfitting, especially when applying algorithms such as XGBoost.
- Use domain knowledge to derive new features which help measure success/failure. The number of hospital visits may be an indicator of patient risk; the total amount of foreign transactions in the past month may be an indicator of fraud; the ratio of the requested loan amount to the annual income may be an indicator of credit risk.
- Use Natural Language Processing techniques to derive features from unstructured free text. Some examples are LDA, TF-IDF, word2vec, and doc2vec.
- Use dimensionality reduction if there are a very large number of features, e.g., PCA and t-distributed Stochastic Neighbor Embedding (t-SNE).
5. Anomaly detection is everywhere
If I were to pick one single most common ML usecase in the enterprise, that would be anomaly detection. Whether we’re referring to fraud detection, manufacturing testing, customer churn, patient risk, customer delinquency, system crash prediction, etc., the question is always: can we find the needle in the haystack? This leads to our next topic, which relates to unbalanced datasets.
A few common algorithms for anomaly detection are:
- One-class classification algorithms such as one-class SVM.
- Confidence intervals
- Classification using over-sampling and under-sampling.
6. Data are often unbalanced
Say you have a dataset with labeled credit card transactions. 0.1% of those transactions turn out to be fraudulent whereas 99.9% of them are good normal transactions. If we create a model that says that there’s never fraud, guess what? The model will give a correct answer in 99.9% of the cases so its accuracy will be 99.9%! This common accuracy fallacy can be avoided by considering different metrics such as precision and recall. These are defined in terms of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN):
TP = total number of instances correctly predicted as positive
TN = total number of instances correctly predicted as negative
FP = total number of instances incorrectly predicted as positive
FN = total number of instances incorrectly predicted as negative
In a typical anomaly detection scenario, we’re after minimizing false negatives — e.g., ignoring a fraudulent transaction, not recognizing a defective chip, or diagnosing a sick patient as healthy — while not incurring a great amount of false positives.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Note precision penalizes FP while recall penalizes FN. A model that never predicts fraud will have zero recall and undefined precision. Conversely, a model that always predicts fraud will have 100% recall but a very low precision — due to a high number of false positives.
I strongly discourage the use of receiver operating characteristic (ROC) curves in anomaly detection. This is because the false positive rate (FPR) — which ROC curves rely on — is heavily biased by the number of negative instances in the dataset (i.e., FP+TN), leading to a potentially small FPR even when there’s a huge number of FP.
FPR = FP/(FP + TN)
Instead, the false discovery rate (FDR) is useful to have a better understanding of the impact of FP’s in an anomaly detection model:
FDR = 1 – Precision = FP/(TP+FP)
7. Don’t predict. Just tell me why!
I have come across several projects where the goal is not to create a model to make predictions in real time but rather to explain a hypothesis or analyze which factors explain a certain behavior —this is to be taken with a grain of salt, given that most machine learning algorithms are based upon correlation, not causation. Some examples are:
- Which factors make a patient fall into high risk?
- Which drug has the highest impact on blood test results?
- Which insurance plan parameter values maximize profit?
- Which characteristics of a customer make him more prone to delinquency?
- What’s the profile of a churner?
One way to approach these questions is by calculating feature importance, which is given by algorithms such as Random Forests, Decision Trees, and XGBoost. Furthermore, algorithms such as LIME or SHAP are helpful to explain models and predictions, even if they come from Neural Networks or other “black-box” models.
8. Tune your hyper-parameters
Machine Learning algorithms have both parameters and hyper-parameters. They differ in that the former are directly estimated by the algorithm — e.g., the coefficients of a regression or the weights of the neural network — whereas the latter are not and need to be set by the user — e.g., the number of trees in a random forest, the regularization method in a neural network, or the kernel function of a support vector machine (SVM) classifier.
Setting the right hyper-parameter values for your ML model can make a huge difference. For instance, a linear kernel for a SVM won’t be able to classify data that are not linearly separable. A tree-based classifier may overfit if the maximum depth or the number of splits are set too high, or it may underfit if they maximum number of features is set too low.
Finding the optimal values for hyper-parameters is a very complex optimization problem. A few tips are:
- Understand the priorities for hyper-parameters. In a random forest, the number of trees and the max depth may be the most relevant ones whereas for deep learning, the learning rate and the number of layers might be prioritized.
- Use a search strategy: grid search or random search. The latter is preferred.
- Use cross validation: set a separate testing set, split the remaining data into k folds and iterate k times using each fold for validation (i.e., to tune hyper-parameters) and the remaining for training. Finally, compute average quality metrics over all folds.
9. Deep Learning: a panacea?
During the past few years, deep learning has been an immense focus of research and industry development. Frameworks such as TensorFlow, Keras, and Caffe now enable rapid implementation of complex neural networks through a high level API. Applications are countless, including computer vision, chatbots, self-driving cars, machine translation, and even games — beating both the top Go human player and the top chess computer in the world!
One of the main premises behind deep learning is its ability to continue learning as the amount of data increases, which is especially useful in the era of big data (see figure below). This, combined with recent developments in hardware (i.e., GPUs) allows the execution of large deep learning jobs which used to be prohibitive due to resource limitations.
So… does this mean that DL is always the way to go for any Machine Learning problem? Not really. Here’s why:
The results of a neural network model are very dependent on the architecture and the hyper-parameters of the network. In most cases, you’ll need some expertise on network architectures to correctly tune the model. There’s also a significant trial-and-error component in this regard.
As we saw earlier, a number of use-cases require not only predicting but explaining the reason behind a prediction: why was a loan denied? Or why was an insurance policy price increased? While tree-based and coefficient-based algorithms directly allow for explainability, this is not the case with neural networks. In this article some techniques are presented to interpret deep learning models.
In our experience, for most structured datasets, the quality of neural network models is not necessarily better than that of Random Forests and XGBoost. Where DL excels is actually when there’s unstructured data involved, i.e., images, text, or audio. The bottom line: don’t use a shotgun to kill a fly. ML algorithms such as Random Forest and XGBoost are sufficient for most structured supervised problems, being also simpler to tune, run, and explain. Let DL speak for itself in unstructured data problems or for reinforcement learning.
10. Don’t let the data leak
While working on a project to predict arrival delay of flights, I realized my model suddenly reached 99% accuracy when I used all the features available in the dataset. I sadly realized I was using the departure delay as a predictor for the arrival delay. This is a typical example of data leakage, which occurs when any of the features used to create the model will be unavailable or unknown at prediction time. Watch out, folks!
11. Open source gives me everything. Why do I need a platform?
It has never been easier to build a machine learning model. A few lines of R or Python code will suffice for such endeavor and there’s plenty of resources and tutorials online to even train a complex neural network. Now, for data preparation, Apache Spark can be really useful, even scaling to large datasets. Finally, tools like docker and plumbr ease the deployment of machine learning models through HTTP requests. So it looks like one could build an end-to-end ML system purely using the open source stack. Right?
Well, this may be true for building proofs of concept. A graduate student working on his dissertation would certainly be covered under the umbrella of the open source. For the enterprise, nevertheless, the story is a bit different.
Don’t get me wrong. I’m a big fan of open source myself and there are many fantastic tools, but at the same time, there are also quite a few gaps. These are some of the reasons why enterprises choose Data Science platforms:
a. Open source integration: up and running in minutes, support for multiple environments, and transparent version updates.
b. Collaboration: easily sharing datasets, data connections, code, models, environments, and deployments.
c. Governance and security: not only over data but over all analytics assets.
d. Model management, deployment, and retraining.
f. Model bias: detect and correct a model that’s biased by gender or age.
e. Assisted Data Curation: visual tools to address the most painful task in data science.
g. GPUs: immediate provisioning and configuration for an optimal performance of deep learning frameworks, e.g., TensorFlow.
h. Codeless modeling: for statisticians, subject matter experts, and even executives who don’t code but want to build models visually.
In a quest for a Data Science Platform? Consider trying Watson Studio for free!
About the author
Óscar D. Lara Yejas is Senior Data Scientist and one of the founding members of the IBM Machine Learning Hub. He works closely with some of the largest enterprises in the world on applying ML to their specific use-cases, including healthcare, financial, manufacturing, government, and retail. He has also contributed to the IBM Big Data portfolio, particularly in the Large-scale Machine Learning area, being an Apache Spark and Apache SystemML contributor.
Óscar holds a Ph.D. in Computer Science and Engineering from University of South Florida. He is the author of the book “Human Activity Recognition: Using Wearable Sensors and Smartphones”, and a number of research/technical papers on Big Data, Machine Learning, Human-centric sensing, and Combinatorial Optimization.