Original article was published on Artificial Intelligence on Medium
With the information I just obtained from the graph, I already have an idea of the kind of classifier I am going to use: a Naive Bayes Classifier. This machine learning classifier performs extremely well on normally distributed data (do not believe developer who mocks it!). If the distributions are situated apart from each other, even better, it will be much easier to distinguish among the three different classes.
Given that this is a very simple dataset and the data is already in numerical form, I personally do not think I need to make any preprocessing. If you are a beginner, know that you need to preprocess data when you have to prepare it for your model (for example converting categorical data to encoded data).
The only thing I am going to do is extracting labels from the dataset so that I can feed it to the model.
I will now split X (features) and y (labels) into train and test. As a default, I will use a .2 proportion for the test side.
X_train, X_test, y_train, y_test = v.split(0.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
6. Machine Learning Model
It is now time to create my AI:
Creating the Model
I will be using the scikit-learn library, one of the best open-source machine learning libraries.
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
Training the Model
I will use my train samples to find the rules that link X to y. Then, I will make an estimate on X_test and compare it with the actual results that the model has never seen: y_test.
clf = clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
from sklearn.metrics import accuracy_score
100%! Astonishing result!
I have only been splitting the dataset once, but, to improve the validity of the model, I can use a cross-validation algorithm to test the model on 10 different splits, each one with different data taken from the dataset.
v.statistics.cross_validation(clf, X_train, y_train, 10)Accuracy: 0.96 (+/- 0.09)
Depending on the data in train and test determined by the split, the accuracy ranges from 85% to 100%, with an average of 96%. The result can vary, the top I obtained is 98% after a few attempts.