Predicting a Pulsar Star using different Machine Learning Algorithms

Source: Deep Learning on Medium

Predicting a Pulsar Star using different Machine Learning Algorithms

Pulsar Star

INTRODUCTION

Pulsars are spherical, compact objects that are about the size of a large city but contain more mass than the sun. Scientists are using pulsars to study extreme states of matter, search for planets beyond Earth’s solar system and measure cosmic distances. Pulsars also could help scientists find gravitational waves, which could point the way to energetic cosmic events like collisions between super massive black holes. Discovered in 1967, pulsars are fascinating members of the cosmic community.

Pulsars radiate two steady, narrow beams of light in opposite directions. Although the light from the beam is steady, pulsars appear to flicker because they also spin. It’s the same reason a lighthouse appears to blink when seen by a sailor on the ocean: As the pulsar rotates, the beam of light may sweep across the Earth, then swing out of view, then swing back around again. To an astronomer on the ground, the light goes in and out of view, giving the impression that the pulsar is blinking on and off. The reason a pulsar’s light beam spins around like a lighthouse beam is that the pulsar’s beam of light is typically not aligned with the pulsar’s axis of rotation.

Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation . Thus a potential signal detection known as a ‘candidate’, is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detection are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.

Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, which treat the candidate data sets as binary classification problems.

This blog takes up different Machine Learning classification algorithms to depict the prediction of a pulsar star. The dataset used for building the model is available at www.kaggle.com .

USING VARIOUS MACHINE LEARNING ALGORITHMS

Diagram depicting different ML algorithms

We will be using Supervised Learning in our model to reach the desired output. In general, supervised learning occurs when a system is given input and output variables with the intentions of learning how they are mapped together, or related. The goal is to produce an accurate enough mapping function that when new input is given, the algorithm can predict the output. This is an iterative process, and each time the algorithm makes a prediction, it is corrected or given feedback until it achieves an acceptable level of performance.

Our problem calls for deploying Classification algorithms. Brief;y, Classification either predicts categorical class labels or classifies data (construct a model) based on the training set and the values (class labels) in classifying attributes and uses it in classifying new data. There are a number of classification models. Classification models include logistic regression, decision tree, random forest, gradient-boosted tree, multi-layer perceptron, one-vs-rest, and Naive Bayes.

Dataset

Let us take a look at the dataset that we are using for running various Classification algorithms.

Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive). Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency . The remaining four variables are similarly obtained from the DM-SNR curve . These are summarized below:

  1. Mean of the integrated profile. (DECIMAL)
  2. Standard deviation of the integrated profile. (DECIMAL)
  3. Excess kurtosis of the integrated profile. (DECIMAL)
  4. Skewness of the integrated profile. (DECIMAL)
  5. Mean of the DM-SNR curve. (DECIMAL)
  6. Standard deviation of the DM-SNR curve. (DECIMAL)
  7. Excess kurtosis of the DM-SNR curve. (DECIMAL)
  8. Skewness of the DM-SNR curve. (DECIMAL)
  9. Class (INTEGER)
Attribute Information
Attribute Information

Next, let us import the necessary libraries required for building our model(s).

Now, let us load the dataset.

Here, the dataset in form of a csv file is loaded from the user’s drive.

So, this is how our dataset looks.

Sample of the dataset

Data Preprocessing

We observe that there is no null value present in any columns of the attributes, moreover, the range of the values is quite optimum with the same datatype throughout. Thus, it saves us from deploying any of the Data Preprocessing techniques like missing value imputation, normalization, standardization, scaling, etc.

Information of the values in the columns of attributes

Correlation

Data correlation is the way in which one set of data may correspond to another set. In ML, think of how your features correspond with your output.

One cannot use linear regression to model a nonlinear dataset. The opposite is also true. If you have a linear correlated dataset you need a simple model like linear regression. Even the best CNN will give you a poor result.

Thus, it becomes very hard to figure out how data correlates if you have more than two features. Data visualization can help find how individual features may correlate with the output.

We have plotted a heat map to check how the attributes are related to each other.

Heat-map depicting correlation between attributes

We do not find any linear correlation between the attributes.

Splitting the Dataset

We split our dataset into 75%-25% training set and testing set.

LOGISTIC REGRESSION

Logistic regression algorithm also uses a linear equation with independent predictors to predict a value. The predicted value can be anywhere between negative infinity to positive infinity. We need the output of the algorithm to be class variable, i.e 0-no, 1-yes. Therefore, we squash the output of the linear equation into a range of [0,1]. To squash the predicted value between 0 and 1, we use the sigmoid function.

Code for Logistic Regression

Evaluating the model

Evaluating your machine learning algorithm is an essential part of any project. Most of the times we use classification accuracy to measure the performance of our model, here also, we will be using the accuracy score to evaluate our model.

We define Accuracy score/Classification Accuracy as what we usually mean, when we use the term accuracy. It is the ratio of number of correct predictions to the total number of input samples.

Accuracy for Logistic Regression

DECISION TREE CLASSIFIER

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning.

Let us use the decision tree classifier on our model.

Code for Decision Tree Classifier

Evaluating our model

Accuracy for Decision Tree Classifier

RANDOM FOREST CLASSIFIER

Random forest classifier creates a set of decision trees from randomly selected subset of training set. It then aggregates the votes from different decision trees to decide the final class of the test object. It is an ensembled algorithm. Ensembled algorithms are those which combines more than one algorithms of same or different kind for classifying objects.

Code for Random Forest Classifier

Here, n_estimators = Number of decision trees used.

Evaluating our model

Accuracy for Random Forest Classifier

SUPPORT VECTOR MACHINE

SVM can be formally defined as, “ A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.”

Code for SVM Classifier

Here, the learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra. This is where the kernel plays role. We have used the ‘RBF’ kernel which is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.

Evaluating our model

Accuracy for SVM Classifier

COMPARISON FOR THE DIFFERENT ALGORITHMS

Accuracy Scores comparison for various models

Observing from the above table, it is quite evident that Random Forest Classifier (n_estimators=45) performs the best among the above models for our dataset with an Accuracy score of 0.9832402234636871 . Also, increasing the n_estimators tends to overfitting of the model, hence, decreasing the overall accuracy of the model.

CONCLUSION

The blog was an attempt to explain the various Classification Algorithms that can be used to predict a pulsating star. The attributes selected were quite relevant to the model with a effective range of values thus giving us a great accuracy without the need of Data Preprocessing techniques like normalization, scaling, etc. 🙂