Machine Learning Guide: Principal Component Analysis (PCA) on Breast Cancer Dataset

Original article was published on Artificial Intelligence on Medium

Machine Learning Guide: Principal Component Analysis (PCA) on Breast Cancer Dataset

How to apply PCA on a real-world dataset

Photo by Kelly Sikkema on Unsplash

Data has become more valuable than ever with the tremendous advancements in data science. Real life datasets usually have many features (columns). Some of the features may be uninformative or correlated with other features. However, we may not know this beforehand so we tend to collect as much data as possible.

Another drawback is that, as the number of features increases, the performance of a classifier starts to decrease after some point. More features result in more combinations that the model needs to learn in order to accurately predict the target. Therefore, with same amount of observations (rows), models tend to perform better on datasets with less number of features. Moreover, a high number of features increase the risk of overfitting.

In some cases, it can be possible to accomplish the task without using all the features. Due to computational and performance reasons, it is desired to do a task with less number of features if possible. Uninformative features do not provide any prediction power and also cause a computational burden.

There are two main methods to reduce the number of features. The first one is feature selection which aims to find the most informative features or eliminate uninformative features. Feature selection can be done manually or using software tools. The second way is to derive new features from the existing ones with keeping as much information as possible. This process is called feature extraction or dimensionality reduction.

One of the most widely used dimensionality reduction algorithms is Principal Component Analysis (PCA). PCA is an unsupervised learning algorithm which finds the relations among features within a dataset. It is also widely used as a preprocessing step for supervised learning algorithms. This post is more like a practical guide than a detailed theoretical explanation. Here is a more intuitive approach to PCA if you would like read more about theoretical side:

In this post, I will go over breast cancer dataset and apply PCA algorithm to narrow the dataset. Let’s start with importing the related libraries:

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

Then we read the breast cancer dataset available on scikit-learn:

dataset = load_breast_cancer()
X, y = load_breast_cancer(return_X_y = True)

This dataset includes 30 features about a cell and a target variable indication the cell is benign or malignant:

print("Features {}".format(X.shape))
print("Target {}".format(y.shape))
Features (569, 30)
Target (569,)

The features are:

We first create random forest classifier to predict the target variable using all of the 30 features. Then we apply principal component analysis to reduce the number of features to 2. Thus, we aim to explain the variance of 30 features with only 2 features. We will definitely not be able explain all the variance in the dataset but if we get a satisfying result, we can go with 2 features. As always, we import the libraries first:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

We create a random forest classifier. The dataset is splitted into training and test subsets using train_test_split function. Then we train the classifier using training set. One important point to mention is that we also need to normalize the values. It is always a good practice because machine learning models tend to give more importance to the features with higher values. In order to obtain a robust model, normalization is necessary.

#Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#Normalization
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Create a classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=4)
#Train the classifier
clf.fit(X_train, y_train)

The model is trained. We can now measure its performance on training and test sets:

clf.score(X_train, y_train)
0.9976525821596244
#Test set
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)
0.951048951048951

The model achieved 99% accuracy on training set and 95% on test set which I think is a pretty decent result.

Let’s apply PCA to the dataset now:

#Read the data
dataset = load_breast_cancer()
X, y = load_breast_cancer(return_X_y = True)
#Normalize
sc = StandardScaler()
X_normalized = sc.fit_transform(X)
#Apply PCA
pca = PCA(n_components = 2).fit(X_normalized)
pca_X = pca.transform(X_normalized)

Now we have 2 features instead of 30.

pca_X.shape
(569, 2)

Let’s plot these new features to have an idea if they can be successful to distinguish two different classes in the target variable:

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(10,6))
plt.title("Principal Components", fontsize=18)
plt.scatter(X[:,0], X[:,1], c=y, cmap='Paired')

One class is represented with brown and the other is with blue. Although there is some overlapping, it seems like we are able make a good distinction with principal components. It is important to note that we represent 30 features with 2 principal components.

Let’s apply same classification model with using principal components. We will try to predict the target variable using 2 principal components instead of 30 features:

X_train, X_test, y_train, y_test = train_test_split(pca_X, y, random_state=1)clf = RandomForestClassifier(n_estimators=100, max_depth=4)clf.fit(X_train, y_train)print("Accuracy on training set {}".format(clf.score(X_train, y_train)))y_pred = clf.predict(X_test)print("Accuracy on test set {}".format(accuracy_score(y_test, y_pred)))

The results are not as good as the previous ones but they are definitely close. Also, the difference between training and test accuracy decreased which is an indication of reduced overfitting.

PCA class of scikit learn provides an attribute called explained_variance_ratio which shows how much of variance in the dataset is explained by the principal components.

pca.explained_variance_ratio_
array([0.44272026, 0.18971182])