Source: Deep Learning on Medium
Statistical Modeling — The Full Pragmatic Guide
Interpreting Machine Learning Models — Part 2
Continuing Our series of posts on how to interpret Machine Learning algorithms and predictions.
Part 0 (optional) — What is Data Science and the Data Scientist
Part 1 — Introduction to Interpretability
Part 1.5 (optional) — A Brief History of Statistics (May be useful to understand this post)
Part 2 — (this post) Interpreting models of high bias and low variance. Linear Regressions.
Part 3 — Interpreting low bias and high variance models.
Part 4 — Is it possible to resolve the trade-off between bias and variance?
Part 5 — Local Methods of Interpretability.
Part 6 — Global Methods of Interpretability.
In this post we will focus on the interpretation of high bias and low variance models, as we explained in the previous post, these algorithms are the easiest to interpret so assume several prerequisites in the data. Let’s choose the linear regressions to represent this group of algorithms. If you have no idea what Linear Models are, you might want to check out the article A Brief History of Statistics.
All codes for this post are available on Kaggle.
The purpose here is not to explain what these linear models are or how they work, but how to interpret their parameters and estimates, but a brief introduction may be helpful. Linear models can be simple regressions like OLS, can be regular regressions like Lasso and Ridge,
They can be models for classification such as Logistic Regressions and even for time series such as ARIMA filters. They all have in common the fact that they have linear parameters, that is, when we estimate the “weights” of the variables they are constant for any level. Interestingly, a neural network can also be a linear model if its activation layers are linear (f (x) = x), and such a one-layer network will be similar to our simple linear regression that we will use here, but incredibly less efficient.
Let’s create a theoretical world where we are interested in interpreting the effects of various variables on people’s earnings. In our hypothetical world we have a minimum wage of $ 1000 and each year of education increases on average $ 500 in monthly salary. Because our world is stochastic (not deterministic), we have randomness.
When running a regression model, we get the line that produces the smallest possible error: yhat = x * 496 + 1084. That is, the model was able to “understand” the reality we created and estimated that the slope coefficient is ~ 496 (very around $ 500 that was created) and the ~ 1084 intercept and the interpretation in this case is quite straightforward. He identified the minimum wage (when education equals zero) and when a year of education alters people’s income, $ 500.
But this case is very simple and very far from reality. By the way, this case is very similar to the model created by Galton in the nineteenth century, the correlation coefficient ‘r’ is the same as R² (only squared). In the real world we have many variables that explain wages, so let’s insert more variables into this model.
In our “world v2” we will have the following behavior, more like reality:
Salaries are explained by three components as follows:
– Grit = Random variable ranging from 0 to 16.
– Education = Random variable from 0 to 16 + part of effort as effort affects how much you educate yourself.
– Experiment = random variable from 0 to 25.
$ = Grit* 200 + experience* 200 + education* 300 + random part
One way to look at these relationships between variables is through a correlation chart in a heatmap:
Looking at the first column, we would think that the most important variable is Grit / Claw because its correlation with Salary is the highest, and we would say that Experience and education have almost equal effects.
An alternative way to display the behavior between the variables that I prefer but still can’t make popular is through graphs, where each node is a variable and the color intensity of the edges is the “strength” of their correlations:
We can see more clearly that Salary is the central variable, that Education and Claw are correlated with each other, so by estimating the correlation between Claw and Salary, we are possibly capturing part of the effect of education, overestimating the effect of claw and underestimating the effect of claw. education. We say the correlation is “contaminated”.
How to solve it?
The great trick to interpreting Linear Regressions is to understand how partial correlations work. If you understand this deeply, it will be halfway to start doing causal analysis that is the subject of another post. To do this, let’s create a “statistical language” with Venn diagrams as follows:
- Each circle represents a variable.
- The size of the circle represents the variance of this variable;
- The intersections between the circles represent the covariance of these variables. We can interpret as correlation without loss of generality.
How do we read this representation? Basically the salary has a variance that is explained by Education and Grit, but as Education and Grit are correlated they explain the same stretch of variance, ie there is a double count. When we use such partial correlations, basically what we are doing is throwing away this double counting and capturing only pure correlations, which are uncorrelated with any other model variable. In that case we will play that number 100 which is explained by both Grit and Educ and we would leave only 200 (Grit -> $) and 300 (Educ -> $). And that is exactly what linear regressions do for us:
Let’s play it. When regressing without the Educ variable or without the Grit variable, we noticed that they captured each other’s effect, ie, for the purpose of predicting wages, removing the variables would not disturb so much since as they are correlated, part of the effect is captured by the remaining variable. To interpret the effects, Ideally, we put all the variables that are important, otherwise the estimated effects will be contaminated. In the case of the Exp variable (which was constructed without correlation with the others) the partial correlation is very similar to the traditional correlation since there are no joint effects. With Venn diagrams:
Statistical modeling and its interpretations.
As we have repeated several times, this model has many prerequisites, so let’s start breaking them down and interpreting them.
One of the strongest hypotheses is that the return of variables (X) over targets (y) has to be constant, hence the name of the Linear model (since the estimated parameters are linear). But what if the behavior is not exactly linear? Do we have to resort to other models? The short answer is no,
We can model the problem to understand nonlinearities. Let’s go to the examples:
Let’s imagine that the return on education is no longer steady over wage, that it actually peaks and then begins to decline. That is, not only does it not increase forever, but the speed with which it increases decreases until it reverses. This is a very acceptable hypothesis and can be observed with real data. In estimating a linear model of this new reality, we have a rather strange result:
Doesn’t look like a good fit right? This is very common in the nature of the problems, we have effects that get stronger or weaker along the variable and the way we handle it is by adding the education variable twice, a linear (original) part and a quadratic part, so a linear model can understand nonlinear behavior:
As the model continues to estimate partial correlations, to interpret these variables we need to consider both parts of education simultaneously, the estimated linear and quadratic parts are: 648 and -32 when the actual data was 600 and -30. Thus we can, for example, calculate the education that maximizes wage by taking the maximum of the curve.
Another very common case of nonlinear effects is given when variables instead of having a constant nominal effect have a constant percentage effect. An example would be to estimate the effect of headcount (X) on the production of a bakery (y). If there is only one employee, productivity is high.
When you hire one more production increases a lot, they can take turns, while one meets the other replenish the stock, etc. As we add more employees, productivity is dropping and adding the tenth employee is no longer so productive but increases production. We call this a marginally diminishing effect, and one way to model this problem is by applying the natural logarithm (ln). An extra employee when you have 1 is a 100% increase whereas an extra employee when you have 10 is just a 10% increase.
In addition to correcting this percentage increase behavior, logging helps to mitigate the effects of left asymmetric distributions, outliners, and often transforms distributions like this into a distribution much like a normal one.
How do we interpret this new variable after passing the logarithmic transformation? Basically as a percentage change rather than nominal. Let’s go to the examples:
When we run the regression over the Salary ln two things happen. The first is that R2 increases from 0.065 to 0.125 (doubled!), Meaning our modeling is on the right track. But when we look at the estimated value for education we see that it went from 300 to 0.0062, how to interpret it? Percentage changes! The new interpretation will be, one more year of education instead of raising $ 300 in salary increases this model by 0.0062% we call it log-level and the estimated value becomes a semi-elasticity. If we log the two variables, it would be a logistic model. log and interpretation would be: For a 1% increase in education increase the estimated percentage value in variable y. We call this effect elasticity (equal to those price elasticity we always see on the pricing team).Variáveis Categóricas
We already know from other models how to add a categorical variable, we need to model it as a dummy (0 or 1) and run the regression with this new variable, let’s create a variable like this in our salary model that will represent whether or not the individual was born in Brazil, as our institutions are not the best, For the same individual with the same experience, education and effort, the “hatred” of living in Brazil is (in our theoretical world) $ -1000.
Note that the predicted values are parallel, indicating that the slope of the line is given entirely by education and that born or not in Brazil only shifts the entire curve downward.
Dummies are super powerful and can help control many complex effects, including time, but this is a matter for causal posting.
Interaction between variables.
Lastly, we need to correct a rather complex behavior of nature which is that variables interact with each other. An example of this is the previous case, is the return on education the same regardless of the other variables? Should not living in Brazil or not affect this return? If so, this model, Because it is very simple, it does not understand this behavior, and again we need to model it so that the model comes to understand.
In our example, the variable “Brazil” will interact with the variable Education. For the model to understand this behavior we just need to create a third variable that is the multiplication of the two. Note that we are multiplying by 1’s and 0’s, therefore the new variable will have the repeated education column when the observation is equal to 1 and will have zero otherwise and the regression result is as follows:
Allowing the model to interact with two variables allows the return of education to be different depending on the case. Now the interpretation will be that although being born in Brazil has a negative effect (keeping everything more constant), the return of education in Brazil is higher (the line is more inclined) than outside Brazil. That is, I have an estimated education inclination (eg 300), an estimated birthrate in Brazil (eg -1000, and an estimated education value of Brazilians (eg 50), when we want to analyze the return of education in Brazil we need to add the 300 (from everyone) + the return for being Brazilian 50.
All codes and plots are available in Kaggle.