Original article was published on Artificial Intelligence on Medium
Automate the Machine Learning Model Implementation with Sklearn Pipeline
In this tutorial, we will see how to speed up the model implementation step in Machine Learning algorithm development.
Many times while working on Machine Learning problems, we come across the Machine Learning task where we want to preprocess our data and test our model with different classifiers to choose the best one. In such cases, fitting each classifier individually on training data and then testing the model is too tedious, not to mention there’s a large amount of redundant coding is also involved. Plus, if your algorithm involves cross-validation and your preprocessing step involves operation like normalization or standardization, performing normalization or standardization on the full training set before learning will influence your training set with the scale of the test set. Wouldn’t it be nice if there was a single solution to all these problems?
Well, there’s! Scikit-Learn has a Pipeline module that provides an easy way to tackle the above problems.
Pipeline is a function that sequentially applies a list of transforms and a final estimator. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
Now let’s see an implementation that will show how can we use Pipeline to make the task at hand easier.
Here I’m going to use Iris Dataset, Standardize it, and test the number of classifiers to see which one gives the best result. Here I am going to test the algorithm with KNN, SVM, Random Forest, and Logistic Regression.
Firstly, we would need to import all the required modules.
Now, we will import the Iris dataset from Scikit-Learn and split it into the train and test set.
Next, we will make a list of classifier names and their respective functions from Scikit-Learn. And finally, zip them together. This step will ensure that we pass all the classifiers to our Pipeline function in a single shot along with their names.
One thing to note here is that we are going to select default values for all the hyperparameters for all the classifiers.
Now, we will define the function which will take the zip of classifiers as well as train and test data as an input, prepare Pipeline of Standardscalar with classifiers, and feed the result of Pipeline to fit_classifier() function which we will define shortly.
Finally, we define fit_classifier() function as mentioned earlier. This function receives Pipeline along with train-test data. It fits the Pipeline to train data with the pipeline.fit(), calculates the predictions and accuracy score.
Now it’s time to test our algorithm.
After running the above line, we get the following results.
Finally, we will see the classifiers and their respective accuracy scores.
Running above gives following result.