Visualization and understanding: Iris Dataset

Source: Deep Learning on Medium


In the field of Machine Learning, we usually do not have such ideas from where to start and how to analyze the datasets having zero knowledge of it. So do not worry I will try to guide you so that you can learn better, not just learning you can also learn how to visualize the dataset and with what certain extent of a feature we can implement on different platforms.

We have many playing datasets in the form of Regression, Binary Classification, Multivariate Classification, NLP and many more. Some of them are Iris Dataset, Loan Prediction Dataset, Boston Housing DataSet, Wine Quality Dataset, Breast Cancer Dataset, etc. Here let us try to understand the most versatile, easy and resourceful dataset in pattern recognition literature, i.e. IRIS DATASET. I tried it on different platforms like I use some machine learning algorithms, neural networks and on tensorflow. And the results are a varied little bit each time. One of the questions may be arising in your mind that what IRIS Dataset is actually?

About IRIS Dataset:-
It is also known as Toy Dataset as it is easy to understand as all work is done in only a single CSV file. IRIS Dataset is about the flowers of 3 different species. We can say that they are the labels for us namely- Iris-Setosa; Iris-Virginica; Iris-Versicolor. 
The dataset contains 150 samples and also having four features; length and width of sepals and petals and 50 samples of these three species. These measures were used to create a linear discriminant model to classify the species. The dataset is often used in data mining, classification and clustering examples and to test algorithms. This is one of the common and famous examples of Unsupervised Learning. Unsupervised learning is nothing but just it is the study of the same thing (here, IRIS flower) and classifying them (its 3 different species).

Fig.1- IRIS Types

Always there is a usual way to implement the dataset in some way so you can train and test them later and analyze them too. First, let us understand how you can visualize the data.

VISUALIZATION:-
Visualization and analyzing the data is the most vital part of any beginner because then only you can know what the plots are telling us. There are different plots available like Univariate Plot and Multivariate Plot. So, for making plots and visualizing them through our senses python provides some libraries like seaborn and matplotlib. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics whereas Matplotlib is used to make the plots In the graphs having labels of each feature and also various kinds of plots are available as per our needs like box plot, histogram, pair plot, scatter plot 3D plot, violin plot (tells the density regarding the feature as a base), etc. By visualizing and analyzing we can know the difference between each species on various features available. After this process, we proceed to our next step.

Fig.2 Plot Showing data points moving towards centroid and the cluster formation takes place

PART-I
Using Machine Learning Algorithm

Here we are going to use the K-means Clustering Algorithm. In this, cluster formation of the same types takes place while the clusters which are of different type are separated by some distance apart. Clustering is a data analysis technique about the structure of the data. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as Euclidean-based distance or correlation-based distance. It is the unsupervised learning technique as we define the structure of the data by grouping the data points in distinct subgroups. Now let us focus on the K-means algorithm. The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion. Inertia can be recognized as a measure of how internally coherent clusters are.

K-means Clustering Algorithm –
1. It is an iterative algorithm and tries to partition the dataset into K pre-defined distinct non-overlapping subgroups where each data points representing only one group.
2. There is a centroid in between each cluster and each time the data points moving at each step can be visualized by using seaborn and matplotlib libraries and it looks very amazing. Each movement of the data point is stored and visualized.
3. If there is less variation within the clusters, more homogeneous the data points are within the same cluster.
4. Here in K-means, K is the pre-defined input.
5. Working-
(i) First, initialize the centroid by shuffling the dataset and then randomly select K points for the centroid replacement.
(ii) Iteration is done until there is no change to the centroids.
(iii) Compute the sum of the squared distance between data points and all centroids and the distance is calculated with the Euclidean distance.
(iv) Assign each point to the data cluster (centroid).
(v) Compute centroids for the clusters by taking an average of all data points that belong to each cluster and the minimization technique is followed here.
6. We may get stuck at the local optimum as we initialize any random value of K at the early level. So to avoid this run algorithm using different initialization of centroid. (This not happens in IRIS dataset case)
7. It gets implemented from the scikit-learn library.
8. It can be applied to numeric or continuous data with smaller dimensions, it can be applied to any scenario where we want to make a group from random groups. eg. Document classification containing various text files, image formats, word files, pdf files, etc.
The transformation of data points is shown in the below plots.

Fig.3 — Plots of the species

9. Understanding the K value-
(i) K=1 is the worst scenario that we can find calculating the total variation.
(ii) Each time if we increase the cluster, the variation decreases, if clusters are equal to the data points then, in that case, the variation will be 0.
10. It is implemented in the recommendation system that we have in Amazon, Netflix and many more.

Major points to get started:

1. Create a dataset.
It includes importing all required libraries like matplotlib, sklearn for model selection, pandas for loading the CSV file of the IRIS dataset.
Before that, you can download the dataset from Kaggle or UCI Machine Learning Repository. After importing all modules then you just need to open/load that CSV file through pandas. Do remember that you must have the same path of the file location otherwise, it may not be imported.

2. Build the model
(i) Before building, you must understand your model that what it is going to do for you. This part we have done in Visualization which was explained earlier.
(ii) After this, you need to choose an algorithm that you want to implement in the model. Here we are choosing K-means Algorithm or Support Vector Machine. KNN working is explained above and of SVM is explained below. Before directly proceeding we can also determine which among Logistic Regression (LR), Linear Discriminant Analysis(LDA), K-Nearest Neighbour(KNN), Classification and Regression Tree(CART), Gaussian Naive Bayes(NB) and Support Vector Machine(SVM). Here KNN and SVM have almost the same cross-validation score but the more prominent in terms of accuracy and less error is SVM.
(iii) Then make the correlation matrix using the heatmap of the seaborn library.
(iv) In visualization, we get to observe that when we choose to feature as sepal length vs sepal width, Iris-Setosa formed a homogeneous cluster but there is a single outlier while the heterogeneous cluster is formed between the Iris-Vriginica and Iris-Versicolor. And when the feature was chosen to petal length vs petal width then there is a very small amount of overlapping there was in Iris-Vriginica and Iris-Versicolor. So from there, we can distinguish three of them as feature taken as petal length vs petal width.

3. Train the model
(i) Here we need to split the dataset into train and test departments. In testing usually, we have 30% of the data and remaining are in 70% and random seed is taken as 7 here. After splitting the data, we use the fit function which is used at the time of training the data. 
(ii) Earlier we just visualized the data only through there data points. But now we can see there accuracy in comparison with the earlier one by taking both sepal and petal (length and width) as features. 
(iii) Then we form the correlation matrix from those features and its formation takes place via sklearn. metrics library which is required to be imported.
(iv) We can also train the models by using any kind of features like any combination of petals and sepals with there length and width.

4. Make predictions.
(i) We provide Iris-Setosa; Iris-Virginica; Iris-Versicolor with the following labels respectively as [1,0,0];[0,1,0];[0,0,1]. And by seeing these labels we can understand what does our model can predict.
(ii) Foresight is made just after forming the correlation matrix and from these, we can easily identify them. And the average is recorded for all observation.
Thus, our model does predictions with better accuracy.

PART-II
Using Deep Learning Model

We are about to create the simplest neural network named as Artificial neural network here. This network is the oldest form of neural network.
The machine act as a neuron and they are brain-inspired systems that are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. So let get started with some libraries that we are going to use and implement the IRIS dataset.

Loading the dataset(CSV file) is done in the same fashion as it was mentioned above.

(i) A very famous Deep Learning Library we use here is Keras. It is easy to digest. Keras is a high-level neural network API, written in Python and capable of running on top of TensorFlow, Microsoft CNTK, or Theano. It allows for easy and fast prototyping (through user friendliness, modularity, and extensibility) and supports both convolutional networks and recurrent networks, as well as combinations of the two. The core data structure of Keras is a model, a way to organize layers. The simplest type of model is the Sequential model, a linear stack of layers that we used here too.

(ii) For preprocessing and model selection and identifying its score we use the sklearn library.

(iii) For training the model we just need to first encode the labels and send it to train the model through .fit() function. Fitting is equal to training. Then, after it is trained, the model can be used to make predictions, usually with a .predict() method call. Then just transform them to convert class vector (integers from 0 to nb_classes) to the binary class matrix, for use with categorical_crossentropy loss. If we use this loss, we will train a CNN to output a probability over the C classes for each image. It is used for multi-class classification.

(iv) Now we are required to create a neural network having the formation of a minimum of three layers, for the better outputs. And the activation function that we used is “relu” and “softmax”. Softmax we used is for output layers and it is a normalized exponential function, that takes as input a vector of K real numbers and normalizes it into a probability distribution consisting of K probabilities while Relu is used for the input layer and ReLU stands for rectified linear unit and is a type of activation function. Mathematically, it is defined as y = max(0, x). ReLU is the most commonly used activation function in neural networks, especially in CNN’s. If you are unsure what activation function to use in your network, ReLU is usually a good first choice.

(v) With the help of KerasClassifier, we divide our dataset into a batch_size. Batch size must be small in number not too large as we have only 150 samples and also we set the number of epochs. Epoch, an epoch is a hyperparameter that is defined before training a model. One epoch is when an entire dataset is passed both forward and backward through the neural network only once and validators like KFold splits the dataset and shuffle them by having “seed” pre-defined.

(vi) We can estimate our result through cross-validation score and in result, we have both mean and standard deviation. The value which comes out is 97.33% and 4.42% respectively.

Additional Information

We can also use Tensorflow to determine our result, we can visually see the loss in our dataset befalling at each step. In tensorflow we use GradientDecentOptimizer and the loss may start from 2.5 but do not panic, it gets decreases up to 0.07 and the result will be better here as- Training accuracy: 97.1% and 
Validation accuracy: 97.8%

For the information of code and for more queries contact me on mail- spranjal13@gmail.com and review it too. It was just starting, a lot of stuff to go there.