Original article was published by Muhammad Ardi on Deep Learning on Medium
The training is going to be done using simple LSTM-based neural network. Essentially I use this type of neural net because it generally works excellent for sequential data. Here I decided to create the architecture inside a create_model() function. I prefer to do it like this because it’ll be a lot simpler to call the same function rather than declaring the exact same model multiple times.
Just to recap, so far we have already got the data folds in which it is stored in X_fold to X_fold. Also, we got a pair of X_train and y_train which is basically taken from X_fold array, and we pretend that it’s the only known labeled data.
Now let’s do the first training by first initializing the model and directly followed by applying fit() method to the model. Remember that our initial train data consists of 5000 samples. Actually, here I would like to use the last 1000 data for the sake of validating just to check whether the model suffers overfitting. Also, here I decided to go with 2 epochs only since it’s just the right value for this case for some reasons.
After running the code above, we should get the following progress bar. We can see here that the model is pretty good as it achieves 82.7% of accuracy towards validation data, and more importantly, it’s not overfitting.
125/125 [==============================] - 2s 17ms/step - loss: 0.6734 - acc: 0.5918 - val_loss: 0.5636 - val_acc: 0.7400
125/125 [==============================] - 2s 14ms/step - loss: 0.4379 - acc: 0.8307 - val_loss: 0.4079 - val_acc: 0.8270
Now what’s next? In fact, up to this step we already got a “semi-trained” model. The reason why I call it that way is because it’s already trained, but it’s still trained on small scale. If we think of this model as a meat, then it is kinda like “medium-well”. In order to make it “well done”, then we need to use the entire data fold for training. Therefore, the next thing to do is to predict the next data fold (X_fold) using this “medium-well” model and then use these predictions as additional labeled data.
However, we should not use the entire sample predictions to be appended to the next training data. Instead, we will perform some sort of filtering method where predictions with small confidentiality score are just gonna be dropped since there is a possibility that these predictions are incorrect. Additionally, such labeling method is commonly called as pseudo-labeling.
The function below is used to filter out data with a specific confidentiality score threshold. Note that here we use sigmoid activation function where the output value must be somewhere between 1 and 0. By default, the decision boundary used in common cases is 0.5. This means that all output larger than 0.5 are going to be mapped to 1 (positive) while the others will be mapped to 0 (negative). However, in this case I wanna create a threshold with the values of 0.95 and 0.05. These values essentially say that any positive output which has less than 0.95 score will be dropped, while any negative output with greater than 0.05 score is gonna be discarded as well.
Here’s a graph of sigmoid activation function that we implement in the very last layer in our neural network. The output colored in green (>0.95 and <0.05) represents the sample distribution that we are going to use for the next training process.
After running the code above, we can try to print out the shape of X_new to find out the number of remaining samples. Remember that initially all folds contain 5000 data, but here we only got 1406. This output value indicates that 3594 texts are predicted with relatively low confidentiality score.
Since we wanna use the data in X_new for the next training, then we need to concatenate it to our existing X_train array. Here I’m going to use join_shuffle() function to do so. What’s basically done in the function is just appending the new data and directly shuffling it.
Finally, our X_train and y_train have been updated! Hence we can start to do the second training process. The steps are exactly the same as what we have done earlier. The difference is that here we will do the prediction on data in fold 2. Below is the entire process to do so.