“The devil is in the details”. Machine Learning does not escape from it.

Original article was published on Artificial Intelligence on Medium

We will start with the most basic…

Let me use my favorite algorithm for all problems!

As you can imagine, the same piece won’t for all puzzles. How many people finish their first tutorial and start using the same algorithm for any use case he/she can think of? MISTAKE.

Let the data choose the model for you. Get used to the fact that once the data is processed, it is necessary to feed different models and compare their results, to know which ones work best and which should be discarded.

We continue with…

Outliers are not important!

Excuse me?! Did you understand the context of your use case? Outliers can be important or completely ignored, but YOU MUST LOOK AT THE CONTEXT. Or is it not important to identify a peak in the company’s sales? You will lose not only money, but trust.

From a more technical point of view, outliers can take different sensitivities depending on the case and the model we are working with. An example would be comparing the sensitivity of an AdaBoost model (Adaptive Boosting) where outliers are treated as “hard” cases and are charged with quite large weights, and a Decision Tree where an outlier could be identified as a false classification.

Third… Mean squared error is always great!

Okay, we all know this is a good default value, but when we extrapolate and talk about the real world, this error function is generally less optimal for the use case we are trying to solve.

A clear example would be in fraud detection. Imagine that we want to penalize false negatives due to the amount of money lost due to fraud. We could use the mean squared error, it will give us an amount of money, but surely far from the real one. And remember, we are talking about money!

And when we talk about money, mistakes are not accepted.

Days, hours, months … What do I do with the cyclical features?

When we talk about hour 23 we have to make sure that hour 0 is set right next to it. One mistake that is usually made is not converting these features into representations that can preserve the information as it’s originally value, and this case occurs with cyclical features.

Solution? Remember to calculate the sine and cosine components in order to represent them as coordinates (x, y) of a circle. In this way, if we want to represent the time, 23 will always go hand in hand with time 0, as it should be.

What about regressions? …

Understanding the coefficients

Linear regressions generally returns p-values ​​for each of the coefficients. What is the most common mistake? “Higher coefficient — Greater importance” → Error.

Remember, the scale of the variable completely changes the value of the coefficient. If the variables or characteristics are co-linear, the coefficients may change from one to the other. Therefore, the larger our set of features, the greater the probability that they will be co-linear and, therefore, the less reliable will be the isolated interpretations.

As a reminder: If a variable X1 is a linear combination (co-linear) of another X2, means that both are related by the expression X1 = b1 + b2X2, with b1 and b2 being constant, therefore the correlation coefficient between both variables will be 1.

We should know the importance or the weight of the features, but the coefficients don’t tell you the whole reality, only a part of it.

Finally…

Regularize without standardizing, NO!

As you know, learning consists of finding the coefficients that minimize a cost function. Regularization consists of adding a penalty to the cost function. However, sometimes we forget the importance of standardizing before regularizing. Imagine the amount of money we would lose if we developed a model in which we included variables that speak in dollars and others in cents. A DEPOSIT.