Original article was published by Tony Yiu on Artificial Intelligence on Medium
Correlated Features Help Explain Variance
So why does line fitting help us explain the variance? Think about what the line’s equation represents. It defines the relationship between X and Y. By multiplying X by b1 and summing with b0 (b0 + b1*X), we get our prediction of Y based on a given value of X.
Of course, this only works if X and Y are correlated (either positively or negatively). Correlation measures the tendency of two things to move together (or opposite of each other in the case of a negative correlation). For example, people’s height and weight are positively correlated. The taller you are, the more you tend to weigh.
So line fitting allows us to infer the value of something we don’t know, Y, from the value of something that we do know, X, thanks to the correlation between X and Y.
We are explaining the variance of Y using the X variables in our linear regression equation. By attributing (a.k.a. explaining) variance this way, we are in effect reducing it — explained variance is variance that we no longer have to worry about.
The lynchpin assumption is that the X variables are readily measurable, understood, and not themselves a mystery. A common modeling mistake is to forecast something using variables that are themselves not observable in real-time. A model that forecasts the future using data from the future is no good.
Let’s see how this process works. When we start out with just a Y variable, all we know about it is its distribution. In the figure below, all the variance (the green cone) is unexplained and our best guess of the future value of Y is its mean.
Now let’s say we’ve found a feature variable, X, with a positive correlation to Y. The figure below shows how X helps to explain variance. We can segment the observations of Y into two sections based on their X values. For low values of X, Y has an expected value of Mean 1 and a variance approximated by the left red cone. For high values of X, Y has an expected value of Mean 2 and a variance approximated by the right red cone.
Notice how by segmenting this way, the variance (the red cones) has been reduced relative to the original green cone. And our new prediction of Y, either Mean 1 or Mean 2 depending on what X is, is significantly more precise than our previous naive prediction (the mean of Y).
But why stop at two segmentations? If we drew more blue dashed vertical lines, and broke the data into even more buckets, we could explain even more variance and generate even more precise predictions.
And that’s what linear regression does. In effect, it draws tons and tons of vertical lines that break the data into numerous tiny segmentations. This maximizes both the variance explained and the precision around our prediction.