Source: Deep Learning on Medium
We will accomplish this task in 6 parts.
Curating the dataset
Neural Networks(NN) on their own are not much powerful, they need two meaningful dataset. One with the information that are available to us(reviews) and the second that has got the value we want NN to predict(labels). Then NN search for direct or indirect correlation between these datasets.
Developing a predictive theory
We observe the two datsets and try to develop a theory about the information the NN uses “under the hood” to arrive at the prediction. Some of the theory that we can frame are following:-
1. NN will use different characters present in a review to predict the label.
2. NN will use the entire review for prediction
But the the first theory does not have much correlation with the predicted labels. The second theory has great correlation but it does not generalize well.
We can see that there are some words that are present in the positive review and the other words that are present only in the negative reviews. Like “terrible” is present only in negative reviews and “excellent” is present only in positive reviews. Now we verify if our theory is valid or not.
We count the number of different words used in the entire review-dataset, number of words used in the positive reviews and the number of words used in the negative reviews, separately.
If we try and find out the common words in these counters, we see that the words that are found frequently in positive counter are common in the negative counter as well and vice versa. As we want to find the words that are more frequently used in positive reviews as compared to the negative reviews, we will calculate the ratio of word usage between positive and negative reviews.
Following are the ratio of some words.
We can observe that the ratios are more skewed towards positive labels. Perfectly neutral words have a ratio close to 1. The neutral words are a bit biased because of the 1 that we added in the denominator(to avoid division by 0). But later we are going to discard the neutral words so this bias will not make much difference.
We see that the magnitude of ratio of the positive and the negative words are not comparable as they are not on same scale(there is no way to find if one word conveys the same magnitude of positive sentiment as another conveys negative sentiment). We take logarithm of the ratio, so that they become symmetric about 0. And then it is easy to compare the ratio of words that convey negative sentiments versus the words that convey the positive sentiment.
Now we have following values:
We can see that,
a. the words with positive sentiments(“amazing”) have value above 1
b. the words with negative sentiments(“terrible”) have value below -1
c. the words that are neutral(“the”) have value close to 0
Now we want to present the data to our neural network so that it can easily learn ways to predict the desired output. We could easily have decided to output five values(corresponding to the five stars) but it becomes difficult for the neural network to learn such hard prediction. So we force it to output only 2 predictions, 0 for negative and 1 for positive.
We create a set of all the unique words in the review and then for the input to the network, we create layer_0 , a vector whose size is equal to the number of unique words in the reviews.
Now we create a look up table that will create index of every word.
Now we use the following function to create an input vector for a review.
Finally we change the label to 0 or 1 depending on whether it is positive or negative.
In the above implementation we can see that we are allocating space for one review at a time, thus saving a lot of memory.
Now we create a model that will take the preprocessed data, train and predict the output.
- Just to make sure that everything in the defined network is working properly, we make the prediction without training and if the accuracy is not equal to the random guess then we conclude that something in the network is broken. We go back fix the error in our network.
- If the model is not showing improvement in accuracy in few steps then probably there is something wrong with some hyperparameter because though it happens in reinforcement learning that we do not see any improvement for initial steps, but as in this case we have the direct correlation we should see some improvement. So we go back and decrease the learning rate by 1/10 .
- If the learning rate is very low then the training will also slow down.
- 9 times out of 10 it is the change in logic that increases the training speed and accuracy of model and very rarely we need to make some drastic change.
Now we differentiate between the noise versus signal.
Before going into fancy techniques like regularization and other tricks we want to look deeper into our data, as that is the place where we have the gold, neural network can only dig the gold(find pattern) .
as shown in the above image we find that most of the filler words are given high weight, which does not provide the neural network with relevant information for prediction. It also misleads the neural network to provide high significance to the words that are not significant at all. So we change the way these words are counted and see if there is any change in training.
after making the above change we see that our training accuracy has increased to 84.8% from 61.4%. So here we tackled the wasteful data to improve the accuracy of our model. Now we look into the computational efficiency.
The inefficiency is introduced from the fact that the majority of elements in input layer are zeros. So instead of multiplying them as matrix, we access the index and then perform the intended multiplication(equivalent to addition for a particular index).The following code shows the change that led to the increase in efficiency:
Now we get rid of the infrequent words and the words that are very frequent(e.g. the, i… etc.). This gives us increase of 1% in accuracy. With the neural networks it is always trade off between the speed and the accuracy. One example where the speed of going over entire dataset is prioritize over accuracy was the word2vec.
Finally we see that the weights for all the words that convey positive sentiment are updated in similar way. This similarity in weights are represented by the following plot in 2-D.
Through the entire blog we saw how to develop and improve a neural-network to carry-out specific prediction.
This blog was developed while pursuing “udacity deep learning nanodegree” (facebook-scholarship). Instructor was Andew Trask(twitter:@iamtrask).