Source: Deep Learning on Medium

Hi, Today i am going to explain about why we are using Non-Linear activation function in neural network in both mathematical and visually,In next upcoming series of blog i will explain why is first hidden state layer very much important and why are we only using linearly separable Non Linear activation function, and How can we approach Non-Linearly separable Non-Linear activation function in single perceptron.

Lets start ,First we have to know what happen when we do arthematic operation in linear and non linear function.

Linear Addition

when we add two linear function we get other linear function

(2x+1)+(3x+1) = 5x+2

Linear Multiplication

**when we multiple two linear function we non-linear function (2x+1)*****(3x+1)** = **6x²+5x+1**

Non Linear Addition

**As you can see that Non Linear to Non Linear gives Non Linear function and addition does not change the Non Linear order or shape of original bell curve, it just moving and expand or shrink the Non Linear function.**

Non Linear Multiplication

**(2x²+1)** * **(3x²+1)** = **6x⁴+5x²+1**

### As you can see Multiplication changing the order or shape of curve

linear + linear = linear

linear * linear = linear

non linear * linear = non linear

non linear + linear = non linear

if you want to know exactly what is linear function and non linear function watch this video .

Before going into neural network you have know the Logistic Regression and i am not going to explain it here is the reference link- https://medium.com/@vigneshgig/machine-learning-classification-using-logistic-regression-mathematical-concept-220c0103f5cc,

https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc,

but I m going to explain why logistic function is linear separable.I will explain using sigmoid Function(Logistic function).

what is hyperplane?

According to wikipedia , In geometry, a **hyperplane** is a subspace whose dimension is one less than that of its ambient space. If a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines.

Topology and Manifold

If your dataset is non-linear separable to make it linear separable you have to plot data in N+1 higher dimension to make linearly separable as we can see from below diagram.In svm the kernel exactly do this job to make the non-linear separable dataset into linear separable by increasing the N+1 dimension.I am not going explain it detail,If you interested in it you check out this blog http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ and svm kernel trick.

Why Logistics Non -Linear function is linear separable model?

**Why we are using Non Linear Activation Function in Neural Network ?**

**No Activation Funtion Neural Network**

Here i added two hidden state ,After deriving the equation i get linear equation So if no activation function means a neural network can solve only a linear problem it cant solve a non linear complex problem.In real time all the problem are Non linear.If we add more and more hidden state without activation function it only going to increase the learning speed of linear problem.I show you an example

Here i used no hidden state and no activation function,lets how much time taken to get 100% accuracy

Epoch 242/500

100/100 [==============================] - 0s 150us/step - loss: 0.0312 - acc: 1.0000

It taken 242 epoch to get 100% accuracy,

Now i used 2 hidden state

Epoch 51/500

100/100 [==============================] - 0s 250us/step - loss: 0.5813 - acc: 1.0000

As we can see that it just taken 15 epoch to get 100% accuracy

And Now i used 1 hidden state and sigmoid Activation Function,

Epoch 2/500

100/100 [==============================] - 0s 250us/step - loss: 0.5813 - acc: 1.0000

It takes only 2 epoch to get 100 accuracy

**Activation Function**

Here I used one hidden state and 3 neuron in hidden state

Instead of we solve this equation(o) separatly by split it,So that it is easy to understand How a activation function work in a neural network.

playground.tensorflow.org ,You can learn many thing about neurak network So I recommand to play with playground.tensorflow.org

XOR Examples

**Coding**

Epoch 1000/1000

400/400 [==============================] - 0s 90us/step - loss: 0.2501 - acc: 0.3600

With Activation Function(sigmoid)

Epoch 521/1000

400/400 [==============================] - 0s 113us/step - loss: 0.3231 - acc: 1.0000

Epoch 1000/1000

400/400 [==============================] - 0s 130us/step - loss: 0.0609 - acc: 1.0000

As we can see if we convert the non linear dataset N dimension into N+1 dimension it can be linearly separable using some kernel or polyminal function or neural network.Neural network exactly doing this process by converting non linear function into linear separable function in higher dimension in classfication problem.