Machine Learning Models for Detecting Diabetes.

Original article was published by Shreyak on Becoming Human: Artificial Intelligence Magazine

Source

In this blog, I’m going to use Diabetes dataset which I got from Kaggle. I’ll be showing you how to analyse the data and apply different Machine Learning Classification Models.

So I have used 4 different ML models for predicting Diabetes.

  1. RandomForest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.

2. SVC

SVC is a nonparametric clustering algorithm that does not make any assumption on the number or shape of the clusters in the data. In our experience, it works best for low-dimensional data, so if your data is high-dimensional, a preprocessing step, e.g. using principal component analysis, is usually required.

3. KNN

In pattern recognition, the k-nearest neighbour’s algorithm (k-NN) is a non-parametric method proposed by Thomas Cover used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

Big Data Jobs

4. Decision Tree

Decision Tree algorithm belongs to the family of supervised learning algorithms. … The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).

So now let’s jump in coding

Import the necessary libraries.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load Dataset

data = pd.read_csv(“Datasets/pima-data.csv”)

Understand the dataset

data.head()

Trending AI Articles:

1. Natural Language Generation:
The Commercial State of the Art in 2020

2. This Entire Article Was Written by Open AI’s GPT2

3. Learning To Classify Images Without Labels

4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst

Getting co-relations of all the features

data.corr()

Change the diabetes feature in Boolean form

diabetes_map = {True: 1, False: 0}
data[‘diabetes’] = data[‘diabetes’].map(diabetes_map)
data.head()

Analysing The Data

plt.figure(figsize=(12,7))
sns.distplot(data[‘glucose_conc’],kde=True,fit=norm)
plt.figure(figsize=(12,7))
sns.distplot(data[‘insulin’],kde=True,fit=norm)
plt.figure(figsize=(12,7))
sns.distplot(data[‘age’],kde=True, fit=norm)

Split the dataset into training and testing

from sklearn.model_selection import train_test_split
feature_columns = ['num_preg', 'glucose_conc', 'diastolic_bp', 'insulin', 'bmi', 'diab_pred', 'age', 'skin']
predicted_class = ['diabetes']
X = data[feature_columns].values
y = data[predicted_class].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=10)

Apply SimpleImputer for missing data

from sklearn.impute import SimpleImputer
fill_values = SimpleImputer(missing_values=0, strategy=”mean”)
X_train = fill_values.fit_transform(X_train)
X_test = fill_values.fit_transform(X_test)

Applying different Classification Algorithms

  1. RandomForest
# RandomForest Algorithm
from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(random_state=10)
random_forest_model.fit(X_train, y_train.ravel())
randomforest_prediction = random_forest_model.predict(X_test)

2. SVC

# SVC Algorithm
from sklearn.svm import SVC
svc_model = SVC(kernel = ‘rbf’, random_state = 0)
svc_model.fit(X_train, y_train.ravel())
svc_prediction = svc_model.predict(X_test)

3. KNN

# KNeighborsClassifier Algorithm
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = ‘minkowski’, p = 2)
knn_classifier.fit(X_train, y_train.ravel())
knn_prediction = knn_classifier.predict(X_test)

4. Decision Tree

# Decision Tree Algorithm
from sklearn.tree import DecisionTreeClassifier
decisiontree_classifier = DecisionTreeClassifier(criterion = ‘entropy’, random_state = 0)
decisiontree_classifier.fit(X_train, y_train.ravel())
decisiontree_prediction = decisiontree_classifier.predict(X_test)

Analysing the accuracy of all the 4 models

from sklearn import metrics
print(“Accuracy RandomForest = {0:.3f}”.format(metrics.accuracy_score(y_test, randomforest_prediction)))
print(“Accuracy SVC = {0:.3f}”.format(metrics.accuracy_score(y_test, svc_prediction)))
print(“Accuracy KNN = {0:.3f}”.format(metrics.accuracy_score(y_test, knn_prediction)))
print(“Accuracy Decision Tree = {0:.3f}”.format(metrics.accuracy_score(y_test, decisiontree_prediction)))

You can find the complete source code along with dataset here.

So from the above accuracy RandomForest model is best for this dataset. As it provides 73.6% accuracy.

Don’t forget to give us your 👏 !


Machine Learning Models for Detecting Diabetes. was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.