Source: Deep Learning on Medium
When we apply same equation, the resulting plane is what we got in above image which is making up 3 misclassification to maximize distance.To handle this problem we should include concept introduced earlier in this blog called squashing.Squashing is a phenomenon to decreases the impact of extremities/outliers .Due to the effect of squashing the line/plane is less impacted by outliers and thus reducing misclassifications.
From this image we can see only 1 point is mis-classified which is better than what we got above.Effect of squashing comes from the underlying mathematical function called ‘SIGMOID’ .
Sigmoid f(x) lies between 0 and 1 for all values of x .When we apply sigmoid on distances we try to have balance between both outliers which are located at extremities and normal points ,so we have less impact of outliers on line /plane being adjusted resulting in less mis-classifications. This is what Squashing all about.
This is the equation after applying sigmoid which is less prone to outliers.
LETS APPLY ANOTHER MATHEMATICAL FUNCTION RELU INSTEAD OF SIGMOID TO ABOVE OUTLIERS INTRODUCED CATS AND DOGS PROBLEM
Before that let us look at ReLU
As we can see that for any values of x ,f(x) is linearly dependent on x and zero otherwise.
However when we apply Relu to above problem the resulting line/plane is similar as what we got in Logistic regression without Squashing because it just resulting a linear function(wx) when wx≥0 and zero if wx<0.That means Relu cannot squash the impact of outliers or to be more precise extreme points.So we can say that ReLU is more prone to outliers than Sigmoid by the analysis we did so far.
Now let’s us come back to Neural Networks
Simply lets experiment with Neural Networks with Relu and sigmoid by taking a Regression Data-set.
I am using California Housing dataset (for details click here)
The reason for using this data set is that it has some extreme points or outliers so that we can perform our analysis better.This analysis is done in two stages
First stage :Applying various architectures of Neural network on data and test the resultant error .
Second stage: After removing outliers from the data then applying Neural network and test the resultant error.
Fetching the data set
from sklearn.datasets import fetch_california_housingd = fetch_california_housing()
Splitting dataset in train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(da, d.target,test_size = 0.30,)
Standardizing the data
from sklearn.preprocessing import StandardScalersc=preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
Stage 1:Applying NN without removing outliers
Model 1:Architecture:Input-output layered NN (1–1 )
from keras.layers import Input, Dense
from keras.models import Model
model = Sequential()
model.add(Dense(1 ,activation='relu', input_shape=(8,)))
history=model.fit(X_train,y_train, batch_size=32, epochs=600, validation_data=(X_test, y_test))
Model 1 with Relu as activation function.At end of training MSE(loss) on train is 0.50 and on test is 0.509.
Model 1 with sigmoid as activation function.At end of training MSE(loss) on train is 0.389 and on test is 0.402.
As expected the loss is higher with Relu when compared to Sigmoid due to the fact that we had not removed the outliers from the dataset .And due to Relu which does not has Squashing property is the reason behind the huge loss where sigmoid is doing well by squashing those outliers.
However, we only used single neurons in each layer and there were no hidden layers either in model 1.If we change the architecture by adding hidden layers the results may be different .Lets see..
Model 2:Architecture:Input- 2 hidden-output layered NN (64–32–16–1)
model = Sequential()
history=model.fit(X_train,y_train, batch_size=32, epochs=600 ,validation_data=(X_test, y_test))
Model 2 with Relu as activation function.At end of training MSE(loss) on train is 0.2334 and on test is 0.279.
Model 2 with sigmoid as activation function.At end of training MSE(loss) on train is 0.257 and on test is 0.309.
Now results are surprising since relu is performing better than sigmoid .What you think the reason behind this?Well, there are lot more to answer this question,let me put it in simplest way .When we go deep into Neural Networks the loss not only depends on outliers itself rather there are many aspects to consider ,most important one though is vanishing gradient problem which is mainly observed in sigmoid activation .Vanishing gradient is phenomenon in back propagation where the Neural networks does learn anything by just keeping its weights(wx) constant .
Sigmoid in logistic regression is mainly used for squashing but here in Neural networks that squashing function no more remains the same , now acts as activation function which helps in activating particular neuron.
The above image is the example of how less number of hidden layers and neurons which are making mapping function(blue line) to get impacted by outliers which mainly happens with Relu activation function and less on sigmoid. Having said,now lets get into stage 2.
Stage 2:Applying NN after removing outliers
Since we have less features we can analyze each feature individually by using BOX plots to detect outliers in the dataset.The image below shows box plots of 6 features from the dataset
We can see that points with yellow circles are outliers .So lets see if removing these points from the dataset can reduce MSE (loss).
Calculating percentiles for each features
print('99TH AND 100TH PERCENTILES OF FEATURE AVEBEDRMS:',np.percentile(da.AveBedrms, [99,100]))
# OUTPUT:99TH AND 100TH PERCENTILES OF FEATURE AVEBEDRMS: [ 2.12754082 34.06666667]
Similarly for remaining features , if 99th percentile and 100th has large difference than select thresold as 99th percentile and remove remaining points.
After playing around removing extreme points which constitutes about 2 .4 percent of whole dataset, we can again split our data into train and test and standardize it .
Applying same architectures discussed above on this data.
Model 1 with Relu as activation function.At end of training MSE(loss) on train is 0.431 and on test is 0.395.
Model 1 with Sigmoid as activation function.At end of training MSE(loss) on train is 0.394 and on test is 0.353.
After removing outliers model 1 with Relu performed significantly better as compare to model 1 with relu in stage 1 and even model 1 with sigmoid has some improved performance due to the fact that sigmoid tend squash the impact of outliers and not completely eliminate their presence ,so that is what brings the change in loss.
Model 2 with Relu as activation function.At end of training MSE(loss) on train is 0.242 and on test is 0.246.
Model 2 with Sigmoid activation function.At end of training MSE(loss) on train is 0.258 and on test is 0.257.
Model 2 with relu seems to perform little better here which is mainly due to the fact that it converges faster than sigmoid and it is less prone to vanishing gradient problem.
From whole experimentation,Relu is impacted by outliers if Neural networks are not too deep .When architecture goes deep Relu behave same as other activation functions which even tends to regularize better and converges faster than others.
You can flow me here-Linkedin
Any comments or if you have any question, write it in the comment.