Understanding Principal Component Analysis

Original article was published by Trist’n Joseph on Artificial Intelligence on Medium

Understanding Principal Component Analysis

A break down of PCA, when to use it, and why it works

Image by Trist’n Joseph

Machine learning (ML) is a subset of artificial intelligence (AI) and it provides systems with the ability to automatically learn and improve from experience without being explicitly programmed. The algorithms employed within ML are used to find patterns in data that generate insight and help make data-driven decisions and predictions. These types of algorithms are utilized every day to make critical decisions in medical diagnosis, stock trading, transportation, legal matters and much more. Therefore, it can be seen why data scientists place ML on such a high pedestal; it provides a medium for high priority decisions, that can guide better business and smarter actions, in real-time without much human intervention.

To learn, ML models use computational methods to understand information directly from data without relying on a predetermined equation. The algorithms are made to determine a pattern in data and develop a target function which best maps an input variable, x, to a target variable, y. It must be noted here that the true form of the target function is usually unknown. If the function was known, then ML would not be needed.

Therefore, the idea is to determine the best estimate of this target function by conducting sound inference about the sample data to then apply and optimize the appropriate ML technique for the situation at hand.

Image by Trist’n Joseph

Now this task has the potential to seem quite simple; how hard can it be to find a function which uses x and outputs y? And sometimes, given the circumstance, it is quite simple. For example, suppose that we want to predict an individual’s income, and the only supporting variables we have are the number of hours worked per year and their hair colour. Chances are that their hair colour will not have that much influence on their income, whereas the number of hours worked will. Therefore, the function to predict income will have the number of hours worked as the input. Done!

But this is not always the case. Developing a model that can produce accurate predictions is quite difficult due to the relationships between variables. In most ‘real-world’ scenarios, there are multiple input variables which exist concurrently. Each input variable can influence the output variable, but they can also influence each other, and understanding these complexed relationships help build better models. Even though the example presented above is simple because there were only two predictor variables to choose from, and one seemed unrelated the scenario, variable selection and model fitting are crucial parts of developing the appropriate function.

Image by Trist’n Joseph

Model fitting refers to making an algorithm determine the relationship between the predictors and the outcome so that future values can be predicted. The more predictors that a model has, the more the model can learn from data. However, sample data usually contains random noise; and this, along with the number of predictors within the model, can cause the model to learn fake patterns within the data. If one tries to combat this risk by adding fewer predictors, it can cause the model to not learn enough information from the data. These issues are known as overfitting and underfitting, and the goal is to determine an appropriate mix between simplicity and complexity.

So, how can we find this balance between simplicity and complexity? This is particularly difficult if there are so many variables that it is not reasonably possible to understand the relationship between each different variable. In cases like this, the idea would be to perform dimensionality reduction. As the name suggests, it involves using various techniques to reduce the number of features within a data set.

This can be done in two main ways: feature exclusion and feature extraction. Feature exclusion refers to keeping only the variables which ‘could be used’ to predict the output, whereas feature extraction refers to developing new features from the existing variables in the data set. Think of feature exclusion as simply dropping or keeping variables which might be included in a model, and feature extraction as creating new (and hopefully fewer) variables from the existing variables.

Image by Trist’n Joseph

Principal component analysis (PCA) is a method of feature extraction which groups variables in a way that creates new features and allows features of lesser importance to be dropped. More formally, PCA is the identification of linear combinations of variables that provide maximum variability within a set of data.

To calculate the components, this method utilizes elements from linear algebra (such as eigenvalues and eigenvectors) to determine what combination would result in maximum variance. The explicit math is not covered within this article, but I will attach suggested material at the end which covers this. Essentially, suppose that the data was plotted on a graph. The PCA method will find the average along each axis (variable) within the data and then shift the points until the centre of the averages is at the origin.

Next, a straight line through the origin which minimizes the distance between itself and all the data points will be fit to the data. An alternative, and equivalent, way of determining this best fit line is to develop a line passing through the origin which maximizes the sum of squared distances from the projected points (along the line) to the origin. Once this line is determined, it is referred to as the first principal component.

Image by Trist’n Joseph

The slope of the initial line can be calculated and manipulated in a way that it yields the optimal mix of variables that will maximize variation. That is, suppose that there are two variables, and the slope of the line is found to be 0.25. It would mean that for every 4 units covered on one axis, 1 unit is covered on the other axis. Therefore, the optimal mix for these two variables would be 4 parts variable 1 with 1 part variable 2.

Assuming that the first principal component does not account for 100% of the variation within the data set, the second principal component can be determined. This refers to the linear combination of variables which maximizes variability among all other linear combinations that are orthogonal to the first. Simply, once the first principal component is accounted for, the second principal component maximizes the remaining variability. If we suppose once again that there are two variables, and that the first principal component is already determined, the second principal component will be the line that is perpendicular to the initial best fit line.

Finally, the amount of variation for each principal component can be determined by dividing the sum of squared distances for each component by the sample size minus 1. Recall that the idea is to reduce the dimensions of the data set. Therefore, the percentage of variation explained by the principal components can be found by totalling the variations and then dividing each by the sum. If it is found that the first principal component accounts for ~90% of the variation within the data, it would be ideal to only use the first principal component going further.

Image by Trist’n Joseph

Although this is great, PCA does have some issues. The most major one is that the results are directly dependent on the scale of the variables. If one variable seems to have more variation because it is on a larger scale than the others, this variable will be dominant in the principal components and will produce less than ideal results. On a similar note, the effectiveness of PCA is greatly influenced by the appearance of skew in the data with thick tails. Lastly, PCA can be quite challenging to interpret, especially since this method mixes variables together to maximize variability.

Despite its challenges, PCA is a sound method of feature extraction and dimensionality reduction and should be used to understand the relationships between variables in extremely large data sets.


Applied Multivariate Statistics with R, Daniel Zelterman



Other Useful Material: