Source: Deep Learning on Medium
Choosing a Machine Learning Model
Ever wonder how we can apply machine learning algorithms to a problem in order to analyze, visualize, discover trends & find correlations within data? In this article, I’m going to discuss common steps for setting up a machine learning model as well as approaches in selecting the right model for your data. This article was inspired by common interview questions that was asked about how I go along with my approach with a data science problem & why choose said model. For more on any of these concepts, check out my ultimate guide for data scientists, here!
Some guidelines we follow as data scientists for creating our models:
- Collect Data (Usually tons of it)
- Establish an objective, a hypothesis to test & a timeline to accomplish this
- Check for anomalies or outliers
- Explore for missing data
- Clean the data based off of our constraints, objectives, hypothesis testing
- Perform statistical analysis & initial visualization
- Scale, regularize, normalize, feature engineer, random sample & validate our data for model preparation
- Train & test our data, with a sense of using our validation portion of said data, to play as our unknown
- Build models based off of classification/regression metrics for either supervised or unsupervised learning
- Establish a baseline accuracy & check our current model’s accuracy on either the training or testing data
- Double check that we’ve solved the problem & present the results
- Ready the model for deployment & product delivery (AWS, Docker, Buckets, App, Web Site, Software, Flask, etc.)
Machine learning tasks can be classified into either supervised learning, unsupervised learning, semi-supervised learning & reinforcement learning. In this article we don’t focus on the last two, however I’ll give some idea of what they’re.
Below are some approaches on choosing a model for machine learning/deep learning tasks:
- Unbalanced data is relatively common.
We can handle unbalanced data through re-sampling of data, which is a methodology of using a data sample to improve the accuracy & quantify the uncertainty of a population parameter. Re-sampling methods, in fact, make use of a nested re-sampling technique.
We split our original data into training & testing sets. After finding suitable coefficients for our model with the help of the training set, we can apply that model on a testing set & find the accuracy of said model. This is the final accuracy before applying it to unknown data, also known as our validation set. This final accuracy allows for a higher hope to get accurate results on unknown data.
However, if we further divide the training set into its training & testing subsets, then calculate the final accuracy of that subset, repeatedly doing this for many of those subsets gives us a maximum accuracy among these subsets! We hope that this model will give us a maximum accuracy for our final testing set. The re-sampling is done to improve accuracy of model. There are different ways of re-sampling of data like bootstrapping, cross-validation, repeated cross-validation, etc.
2. We can create new features through principal component analysis
Also known as PCA, helps reduce dimensionality. Clustering techniques are very common in unsupervised machine learning techniques.
3. We can prevent over fitting, under fitting, outliers & noise through regularization techniques.
For more of this check out my ultimate guide for data scientists here!
4. We need to combat the black box A.I. problem
This makes us consider strategies for building interpret-able models. According to KDNuggets, Black box AI systems for automated decision making, often based on machine learning over big data, map a user’s features into a class predicting the behavioral traits of individuals without exposing the reasons why.
This is problematic not only for lack of transparency, but also for possible biases inherited by the algorithms from human prejudices & collection of artifacts hidden in the training data, which may lead to unfair, wrong decisions & incorrect analysis.
5. Understand algorithms not sensitive to outliers
We can decide to we should use randomness in models, or random forests to overcome outlier skew-ness.
Machine Learning Models
Most of these models are covered in my study guide for data scientists. The guide gives a good definition of what each of these models are, do, when to be used & a simple verbal example. If you want to access my guide, click here as it is also published & recommended on Medium’s “Towards Data Science.”
- First approach to predicting continuous values: linear regression is generally the top choice & most common such as housing prices
- Binary classification approaches usually like logistic regression models. If you have a two class classification problem, support vector machines (SVMs) are highly favorable in getting the best result!
- Multi-class classification: random forests are favored highly but SVMs come in with a tie. Random forests are more meant for multi-class!
For a multi-class, you would need to reduce the data into multiple binary classification problems. Random forests work well with a mix of numerical & categorical features, even if features are on different scales, meaning you can use data as it is. SVMs maximize the margin & relies on the concept of distance between different points. It’s up to decide if the distance really matters!
As a consequence to this situation, we must one-hot encode (dummy) the categorical features. Further, min-max or other scaling is highly recommended as pre-processing steps. For most common classification problems, random forests give probability of belonging to that class whereas SVMs give you the distance to the boundary. You’ll still need to convert it to probability somehow if you need probability. For those problems where SVMs applies, it will out perform a random forest model. SVMs give you support vectors, that is points in each class closest to the boundary between classes.
4. Easiest categorical model to start off with? Decision trees are seen as the simplest to use & understand. They’re implemented through models such as random forests or gradient boosting.
5. Competition favored models? Kaggle competitions favor random forest & XGBoost! What’s gradient boosted trees?
Deep Learning Models
According to Investopedia, deep learning is an artificial intelligence function that imitates the workings of the human brain in processing data & creating patterns for use in decision making.
- We can use multi-layer perceptrons to focus on complex features which can’t be easily specified but have a large number of labelled data!
According to Techopedia, a multi-layer perceptrons (MLP) is a feed forward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input & output layers.
2. For vision based machine learning such as image classification, object detection, image segmentation or image recognition, we would use a convolutional neural network. CNNs are used in image recognition & processing that is specifically designed to process pixel data.
3. For sequence modeling tasks such as language translation or text classification, recurring neural networks are favored.
RNNs comes into the picture when any model needs context to be able to provide the output based on the input. Sometimes the context is the single most important thing for the model to predict the most appropriate output. In other neural networks, all the inputs are independent of each other.