Creating Alternative Truths with Sine Activation Function in Neural Networks

Source: Deep Learning on Medium

Creating Alternative Truths with Sine Activation Function in Neural Networks

Use of sine activation function for training the neural networks faster.

Hello! Today, I am going to talk about using sine function as activation in neural networks. I will try to answer the questions “What is it?”,“How can it change the future of neural networks?”, “What are the weak sides of this approach?” I will also demonstrate one example at the end.

What is sin activation function?

As you may have heard before from thousands of publications, the neural networks are using monotonic functions which are mapping the output of neural network between 0 and 1. This is the right approach since these activation functions give probability-like outputs. But, the disadvantage of this approach is there is only one truth and one wrong for the outputs. But in real life, completely different values may give the same output for an event. So, instead of trying to approach to a small domain of values to get the truth, we need to be able to reach multiple domains of values to get the truth.

The paper “Neural networks with periodic and monotonic activation functions: a comparative study in classification problems” written by Josep M.Sopena, Enrique Romero, Rene Alquezar also claims the approach that we use sine activation. The figure below demonstrates the multiple probabilities.

Figure from paper

A unit with a non-monotonic activation function can divide the space into more than two regions. If the function is periodic, the number of regions is infinite, and it can be interpreted as a wave front spreading through the space of variables.

As we can see from the figure, the output of sigmoid function can be 1 only for a specific value. But, the output of sine function can be 1 for infinite times. This, practically, means that if the neural network that should give output 1 for x and x+100 inputs, we can approach the function of model to y(x)=1 and y(x+100)=1 by using sin(x) and sin(x+100). If we would not use sin function, our network has to adjust its weights and biases so that they can map x and x+100 to a range between 0 and 1.

How can it change the future of neural networks?

It is proved that neural network with monotonic functions are giving satisfactory results. But the real problem of them is training. They are trained so slow because the number of parameters that the network should adjust are reaching to millions. If we would use sin function as the activation, the number of adjustments that the network should make would be less. So, the training times of the networks would decrease significantly. This could reduce the cost of training of neural network models.

What are the weak sides of this approach?

Uncertainty and easy over-fitting. Because the network with sine activation function is adjusting the weights easily and fast, it also causes over-fitting. Not to over-fit the network, we need to give a small learning rate to the model so that we can prevent over-fitting. Mapping the output of network to infinite probability space is actually increasing the uncertainty. The adjustments for one value may cause the other value to map to a hugely different probability.

Implementation

There can be sure different implementations for the forward and backward operation of sine activation function. I just tried the one that I had in my mind.

def error_function_for_sin_single(output,y):
to_search_best = np.sin(np.linspace(output-1,output+1,10000)*np.pi/2)
index = np.argmin(np.abs(to_search_best-y))
to_be = np.linspace(output-1,output+1,10000)[index]
error = to_be-output
#print("to be:",to_be,"as is",output,"error",error/10)
return error
def error_function_for_sin_multiple(self,output,y):
derror_doutput = []
for cnt in range(y.shape[0]):
derror_doutput.append(self.error_function_for_sin_single( output[cnt], y[cnt]))
derror_doutput = np.array(derror_doutput)
#print("____________")
return derror_doutput/2

Explanation of the code:

  • Assume that we have an output array(output) such as [o1,o2,o3,…] which is still not activated by sine function. And we have “to-be” array(y)
  • Firstly, we take each element from array by for loop in error_function_for_sin_multiple function.
  • Then, call error_function_for_sin_single function with the element we took from the array.
  • In error_function_for_sin_single function, we calculate the sine function around the output value(to_search_max). (I multiplied output with pi/2 in sine function because I want value one to be pi/2 which will later map to value 1 as an output of sine function.)
  • Then we find the index of smallest error by calculating the differences between y and to_search_best.
  • The index giving the least error is actually the value that the output should be. So, we find the difference between this value and output so that we can feed back back to neural network for back-propagation.
  • After we find the errors, we append them to a list to give them all to back-propagation.

Demonstration of fast approaching to desired values

Data-set: MNIST Digit Dagaset

Algorithm: Feed-forward neural networks using sine basis function

Layers: Input = 4704 (I do basic feature extraction, that’s why it is not 784), L1 = 512, L2: 128, Output: 10

Learning rate: 0.0000001

I tried to over-train the model to map an image to its label to measure the least epoch.

With sine activation function: Error is decreased to 0.044 in 13 epochs.

Without sine activation function: Error is decreased to 0.043 in 19 epochs.

Also, the paper I mentioned above has several experiments for sine activation function. One of them for a spiral problem.

The results from the paper:

Standard BP with a simple architecture has not found a solution to this problem (see [Fahlman and Labiere, 1990]). [Lang and Witbrock, 1988] solved it in 20,000 epochs using standard BP with a complex architecture (2–5–5–5–1 with shortcuts). “Supersab” needed an average of 3,500 epochs, “Quickprop” 8,000, “RPROP” 6,000, and “Cascade Correlation” 1,700. To solve this problem we constructed a network with architecture 2–16–1, using the sine as the activation function in the hidden layer, and the hyperbolic tangent in the output layer. This architecture is the simplest of those used to date to deal with this problem. Results are shown in table 1. The importance of the range of initial weights is clear. With small ranges for the non-linear parameters learning is not possible (see two rst rows in table 1

The results of experiment made on spiral problem.

I hope that you enjoyed reading this post. While everybody is talking about extremely complex neural networks that are finding solutions to complex problems, I believe that we should still examine the base algorithms of neural networks. Because the changes in fundamentals bring greater impact.