Automated machine learning (AutoML) is a hot new field with the goal of making it easy to select machine learning algorithms, their parameter settings, and the pre-processing methods that improve their ability to detect complex patterns in big data.
The Tree-Based Pipeline Optimization Tool (TPOT) was one of the very first AutoML methods and open-source software packages developed for the data science community. TPOT was developed by Dr. Randal Olson while a postdoctoral student with Dr. Jason H. Moore at the Computational Genetics Laboratory of the University of Pennsylvania and is still being extended and supported by this team.
TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. TPOT makes use of the Python-based scikit-learn library as its ML menu.
Reference : Github url: https://github.com/EpistasisLab/tpot
Genetic Programming (GP) is a type of Evolutionary Algorithm (EA), a subset of machine learning. EAs are used to discover solutions to problems humans do not know how to solve, directly. Free of human preconceptions or biases, the adaptive nature of EAs can generate solutions that are comparable to, and often better than the best human efforts.*
With the right data, computing power and machine learning model you can discover a solution to any problem, but knowing which model to use can be challenging for you as there are so many of them like Decision Trees, SVM, KNN, etc.
That’s where genetic programming can be of great use and provide help. Genetic algorithms are inspired by the Darwinian process of Natural Selection, and they are used to generate solutions to optimization and search problems in computer science.
Broadly speaking, Genetic Algorithms have three properties:
- Selection: You have a population of possible solutions to a given problem and a fitness function. At every iteration, you evaluate how to fit each solution with your fitness function.
- Crossover: Then you select the fittest ones and perform crossover to create a new population.
- Mutation: You take those children and mutate them with some random modification and repeat the process until you get the fittest or best solution.
To install tpot on your system, you can run the command
>>> pip install tpot
on command line terminal . It is built on top of several existing Python libraries
or Click on the link to download.
Hand Written Digits Datasets
Below is a minimal working example with the the optical recognition of handwritten digits dataset.
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_splitdigits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25, random_state=42)tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the tpot_pipeline.py file and look similar to the following:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)features = tpot_data.drop('target', axis=1) training_features, testing_features, training_target, testing_target =train_test_split(features, tpot_data['target'], random_state=42)# Average CV score on the training set was: 0.9799428471757372exported_pipeline = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)# Fix random state for all the steps in exported pipelineset_param_recursive(exported_pipeline.steps, 'random_state', 42)exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
Hope the above explanation would give you clear summary on AutoML.