Why activation Function is used in neural network

Source: Deep Learning on Medium


Hi, Today i am going to explain about why we are using Non-Linear activation function in neural network in both mathematical and visually,In next upcoming series of blog i will explain why is first hidden state layer very much important and why are we only using linearly separable Non Linear activation function, and How can we approach Non-Linearly separable Non-Linear activation function in single perceptron.

Lets start ,First we have to know what happen when we do arthematic operation in linear and non linear function.

Linear Addition

Linear Addition

when we add two linear function we get other linear function

(2x+1)+(3x+1) = 5x+2

Linear Multiplication

Linear Multiplication

when we multiple two linear function we non-linear function (2x+1)*(3x+1) = 6x²+5x+1

Non Linear Addition

Non Linear Additon

As you can see that Non Linear to Non Linear gives Non Linear function and addition does not change the Non Linear order or shape of original bell curve, it just moving and expand or shrink the Non Linear function.

Non Linear Multiplication

Non Linear Multiplication

(2x²+1) * (3x²+1) = 6x⁴+5x²+1

As you can see Multiplication changing the order or shape of curve

linear + linear = linear

linear * linear = linear

non linear * linear = non linear

non linear + linear = non linear

if you want to know exactly what is linear function and non linear function watch this video .

Before going into neural network you have know the Logistic Regression and i am not going to explain it here is the reference link- https://medium.com/@vigneshgig/machine-learning-classification-using-logistic-regression-mathematical-concept-220c0103f5cc,

https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc,

but I m going to explain why logistic function is linear separable.I will explain using sigmoid Function(Logistic function).

what is hyperplane?

According to wikipedia , In geometry, a hyperplane is a subspace whose dimension is one less than that of its ambient space. If a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines.

Topology and Manifold

If your dataset is non-linear separable to make it linear separable you have to plot data in N+1 higher dimension to make linearly separable as we can see from below diagram.In svm the kernel exactly do this job to make the non-linear separable dataset into linear separable by increasing the N+1 dimension.I am not going explain it detail,If you interested in it you check out this blog http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ and svm kernel trick.

Why Logistics Non -Linear function is linear separable model?

Why we are using Non Linear Activation Function in Neural Network ?

No Activation Funtion Neural Network

Here i added two hidden state ,After deriving the equation i get linear equation So if no activation function means a neural network can solve only a linear problem it cant solve a non linear complex problem.In real time all the problem are Non linear.If we add more and more hidden state without activation function it only going to increase the learning speed of linear problem.I show you an example

Here i used no hidden state and no activation function,lets how much time taken to get 100% accuracy

Epoch 242/500
100/100 [==============================] - 0s 150us/step - loss: 0.0312 - acc: 1.0000

It taken 242 epoch to get 100% accuracy,

Now i used 2 hidden state

Epoch 51/500
100/100 [==============================] - 0s 250us/step - loss: 0.5813 - acc: 1.0000

As we can see that it just taken 15 epoch to get 100% accuracy

And Now i used 1 hidden state and sigmoid Activation Function,

Epoch 2/500
100/100 [==============================] - 0s 250us/step - loss: 0.5813 - acc: 1.0000
It takes only 2 epoch to get 100 accuracy

Activation Function

Here I used one hidden state and 3 neuron in hidden state

Instead of we solve this equation(o) separatly by split it,So that it is easy to understand How a activation function work in a neural network.

playground.tensorflow.org ,You can learn many thing about neurak network So I recommand to play with playground.tensorflow.org

XOR Examples

Coding

Epoch 1000/1000
400/400 [==============================] - 0s 90us/step - loss: 0.2501 - acc: 0.3600

With Activation Function(sigmoid)


Epoch 521/1000
400/400 [==============================] - 0s 113us/step - loss: 0.3231 - acc: 1.0000
Epoch 1000/1000
400/400 [==============================] - 0s 130us/step - loss: 0.0609 - acc: 1.0000

As we can see if we convert the non linear dataset N dimension into N+1 dimension it can be linearly separable using some kernel or polyminal function or neural network.Neural network exactly doing this process by converting non linear function into linear separable function in higher dimension in classfication problem.