Original article can be found here (source): Artificial Intelligence on Medium

Let’s get into a little more detail and explore what the rotation represents conceptually and what that does for the data scientist performing PCA.

Methods for automatically reducing the number of columns of a dataset are called dimensionality reduction. One of the most popular ways is principal component analysis or PCA for short.

PCA in data science is a processing step in a pipeline. We want to reduce the complexity of our data set in preparation for procedures we want to perform later in the pipeline.

PCA is a dimensionality reduction technique that can map high dimensional vectors to a lower-dimensional space.

We want to reduce complexity while retaining the informational meaning. In the case of PCA, the information is in the form of variance: how much the data varies. Reducing complexity also reduces vulnerability to aggregated error.

A dataset can have hundreds or thousands or more columns and rows. Each column or row can be a feature. Some features are more relevant than others in their predictive power vis-à-vis a target.

A PCA processed matrix can also be part of a constraint for optimization.

Models built from data that include irrelevant features are cumbersome and vulnerable to accumulated errors. These models are also susceptible to overfitting the training data. They are poorly constructed compared to models trained from the most relevant data.

The number of quantities to calculate in a matrix increases exponentially as the elements in the matrix increases.

For example, a matrix of dimension 3000 has about 4.5 million elements to calculate and track. A matrix of 70 has about 2.5 thousand quantities to estimate. There is a big difference in scale and many fewer opportunities to introduce estimation error.

PCA is an optimization process that reduces the data set, model, and training complexity.

The trick is to figure out which features of the data are relevant and which are not.

Principal Component Analysis compresses data into a format that retains the data’s core information. That core info is the variance of the original data across its columns or rows.

PCA is a series of calculations that give us a new and unique basis for a data set.

# So why is it unique?

PCA calculates new dimensions and ranks them by their variance content.

The first dimension is the dimension along which the data points are the most spread out. The data points have the most variance along this dimension.

# And what does that mean exactly?

PCA creates a new axis in the 2D plane. PCA calculates the coordinates of our data points along this new axis. It does this by projecting them by the shortest path to the new axis.

PCA chooses the new axis in such a way that the new coordinates are spread out as much as possible. They have maximum variance. The line where the coordinates are most expanded is also the line that minimizes the perpendicular distance of each coordinate to the new axis.

The basis minimizes reconstruction error. Maximizing variance and minimizing reconstruction error go hand in hand.

The translation uses the Pythagorean theorem.

The squared distance from the origin to the projection, plus the squared distance from the projection to the point, equals the squared distance from the origin to the point.

# Extraction

PCA extracts the patterns represented by the variance in the data and performs dimensionality reduction. The core of the PCA method is a matrix factorization method from linear algebra called Eigen decomposition.

Say we have a dataset with 1000 columns. Another way to say it is our dataset has 1000 dimensions. Do we need so many dimensions to capture the variance of the dataset? Most times, we don’t.

We need a fast and easy way to remove features that don’t contribute to the variance. With PCA, we can capture the essence of the data of 1000 dimensions in a much lower number of transformed dimensions.

# Variance

Each of the 1000 features, represented by the columns or rows, contains a certain amount of variance. Some of the values are higher and some lower than the average.

Features that don’t vary that stay the same over time provide no insight or predictive power.

The more the variance a feature contains, the more critical the feature. The feature comprises more ‘information’. Variance states how the value of a particular feature varies throughout the data. PCA ranks features by their amount of variance.

# Principal Components

Now that we know the variance, we need to find a transformed feature set that can explain the variance more efficiently. PCA uses the original 1000 features to make linear combinations that extract the variance into new features. These transformed features are the Principal Components (PCs).

The Principal Components have nothing to do with the original features. The transformed feature set, or Principal Components, has the most significant variance explained in the first PC. The second PC will have the second-highest variance, and so on.

PCA helps you to understand if there are a small number of parts of your data, which can explain a large portion of all the data observations.

Say, for example, the first PC explains 60% of the total variance in data, the 2nd feature explains 25%, and the next four features contain 13% variance. In this case, you have 98% of the variance defined by only 6 Principal Components.

Say the next 100 features in total explain another 1% of the total variance. It makes no sense to include 100 more dimensions to get an extra one percent of variance. By taking the top 6 Principal Components, we have reduced the dimensionality from 100 to 6.

PCA ranks Principal components in the order of their explained variance. We can select the top components to explain a variance ratio of sufficient value. You can choose the level as an input to the PCA generator.

PCA provides insight into how variance is distributed through your data set. PCA creates a reduced data set that is easier to handle in matrix math calculations for optimization. PCA is used in quantitative finance in building risk factor models.

Reducing the dimensionality reduces the effect of error terms that can aggregate up. PCA also addresses over-fitting by eliminating superfluous features. If your model is over-fitting data, it will work well on the test data but well perform poorly on new data. PCA helps address this.

# Conclusion

PCA is a feature extraction technique. It is an integral part of feature engineering.

We are looking to create better models with more predictive power, but the map is not the territory. We are creating an abstraction that loses some of the original fidelity. The trick is to make the PC matrix as simple as possible, but no simpler.

Prediction is an ideal that we strive to asymptotically approach. We can’t reach perfection. If we strive for perfection, we can attain excellence.

# Python has PCA built-in.

A great feature of Python, if you needed more to convince you it’s a powerful language for data science, is that it has an easy to use PCA engine in Scikit-learn. Scikit-learn is a free software machine-learning library for Python. Import PCA from Scikit-learn and try it out!

PCA is a crucial tool and component in machine learning. I hope this helps make it more accessible as part of your toolkit.