Machine Learning? What the f?

Original article was published on Artificial Intelligence on Medium

Machine Learning? What the f?

Machine Learning is nothing more than a set of tools for understanding data. These tools can be divided into two categories of learning: supervised, which involves building statistical models to predict or estimate a response (output) based on information we have (inputs), and unsupervised, where we only have a set of information (inputs) but we don’t have a correct answer template and the goal is to learn the relationship and structure of the information we have.

As I have come across several texts, explanations and very good examples on the use of machine learning in our daily life, I go straight to the point of the content that I missed most: mathematical concepts!

I know, I know, it may sound boring, but I guarantee that if you want to do quality work on this topic and be able to have valid discussions and arguments with coworkers and bosses, these concepts are extremely important.

What the f(x)?

Imagine that your company recently ran a marketing campaign to promote a new product. It used TV, Facebook and newspaper advertisements. You have a spreadsheet with all the money invested in each channel and also what was the result of sales of the product launched.

We can say that the channels are its features,inputs or regressors and are called X. The sales result is your answer, output or target, and is called Y.

We can assume that there is a relationship between X and Y, where X = (TV, Facebook, Newspaper), or in the generic form, X = (X₁, X₂, …, Xₚ) for p different variables. This relationship can be written as Y = f (X) + ϵ, where f is a fixed but unknown function of X, and ϵ is the random error, which is independent of X and its mean is zero. Thus, we can say that f represents information that X gives us about Y.

Therefore, we can say that machine learning is nothing more than a set of approaches to estimate f.

Why do we want to estimate f?

First, because we may want to make predictions about a result (Y) using information (X) that we have in hand. For this we can say that the prediction of Y is Ŷ and that f̂ is our estimate for f. Therefore Ŷ=f̂(X). Whenever we put this hat on the notations, we are using estimates.

The accuracy of Ŷ as a prediction for Y depends on two quantities that we call reducible error and irreducible error. Most of the time f̂ will not be a perfect estimate for f and this inaccuracy will translate into an error. This is the reducible error because we can decrease it by improving the accuracy of f̂. However, even if I try a perfect estimate for f, where Y = f̂ (X), that is, the estimate of f generates the real response of the problem, we will still have errors in our prediction because Y is also a function of ϵ, which by definition cannot be predicted by X. So the variability associated with ϵ also affects the accuracy of our prediction, and we call this error an irreducible error, because no matter how well we estimate f, we cannot reduce the error ϵ.

Why is the irreducible error greater than zero? Because we always have a limited amount of information at hand, and ϵ can be the representation of variables or variations that were not measured but are important for predicting Y, and since they were not measured, f does not use them for its prediction.

If we then want to calculate the difference between the answer Y and our estimate Ŷ, we can calculate the expected value, which is nothing more than the average of this difference, squared to disregard signal exchange. Considering then that Ŷ = (X), we have:

E(Y- Ŷ)² = E[f(X)+ϵ- f̂(X)]²

E(Y- Ŷ)² = E[f(X)+ f̂(X)]² + Var(ϵ)

Where E[f(X) + f̂(X)]² is the difference between the real value of f and its predicted value (reducible error) and Var (ϵ) represents the variance associated with the error ϵ (irreducible error).

The second reason we want to estimate f, is to be able to understand how Y is affected by changes in X₁, X₂,…, Xₚ. In this case, even estimating f, our objective is not necessarily in the forecast, but in how Y changes as a function of X.

How do we estimate f?

Considering a data set with n different points observed. These observations are called the training data and we will use it to teach our method how to estimate f. Considering Xᵢⱼ being the value of the jᵗʰ variable for observation i, where i = 1,2,…, n and j = 1,2,…, p. We also call yᵢ to represent the response variable for the iᵗʰ observation. So our training set consists of {(x₁, y₁), (x₂, y₂),…, (xₙ, yₙ)}, where xᵢ = (xᵢ₁, xᵢ₂,…, xᵢₚ) that belongs to training set T.

Our goal here is to apply a machine learning method to the training data to estimate the unknown function f. In short, we want to find the function f̂ that Y ≈ f̂(X) for any observation (X, Y).

The methods for this can be divided into parametric and non-parametric models.

Parametric Model

Parametric models are a two-step approach:

  1. First, we make an assumption about the form of the function f. We can, for example, assume that the function f is linear and therefore f(X) = β₀ + β₁X₁ + β₂X₂ +… + βₚXₚ. Since we assume that f is linear, the problem of estimating f is much easier, since now we no longer need to estimate a random p-dimensional function, but only the coefficients β₀, β₁, β₂,…, βₚ.
  2. After selecting the shape of the function f, that is, the model, we need a procedure that will train our model with our training data. In the case of the linear model, the procedure needed to estimate the coefficients β₀, β₁, β₂,…, βₚ so that Y≈β₀ + β₁X₁ + β₂X₂ +… + βₚXₚ. The most common approach to fitting this model is reffered as least squares.

The disadvantage of this approach is that the models we choose will rarely be equal to the true form of f, which is unknown. If we choose a model too far from the real form of f, our estimate will be poor. We can try to choose more flexible models that can fit in different shapes for f. However, more flexible models require a greater numeber of parameters, increasing the complexity of the model and being able to generate overfitting, which means that the model also follows the form of errors.

Income prediction based on years of education and years of seniority. In this case we assume that the relationship is linear and a plane is drawn. Figure taken from the book Introduction to Statistical Learning, page 22.

Non-Parametric Model

The non-parametric models do not assume any format for the function f, it looks for an estimate of f that comes close to the observation data without being too rough or wiggly. This approach may be more advantageous than the parametric model as it can fit into a wider variety of f formats, however a much larger number of observations is required to be made.

Income prediction based on years of education and seniority using a non-parametric model. Figure taken from the book Introduction to Statistical Learning, page 24.

Trade-off between accuracy and interpretability

The methods used to estimate f may be less flexible, producing a small number of possible formats for f, and others may be more flexible, generating a large number of possible formats for f.
And why would anyone want to choose more restricted methods than a more flexible method? Because more restricted models are much easier to interpret.

Trade-off between interpretability and flexibility of machine learning models. Figure taken from the book Introduction to Statistical Learning, page 25.

These concepts presented are only a tiny fraction of all the infinity of subjects and developments that machine learning and data science have, but their understanding is essential for the construction of everything else.