Source: Deep Learning on Medium
Hello, once again this is the second part of the “Demystifying Generative Models” posts so if you haven’t read Part 1 yet, I really urge you to do so here.
In the previous post, we discussed the differences between discriminative and generative models, took a peek to the fascinating world of probabilities and used that knowledge to develop a working Naive Bayes that generates passwords for us. Now, we will change our methodologies a little bit and explore how Deep Learning can help us when probabilities fail.
Not so useful Naive-Bayes
Assuming that you have the context knowledge, of part 1, I will jump straight to the point. The Multinomial Naive Bayes model that was developed was based on the assumption that each feature is independent of each other and it worked quite well! But let’s change the problem formulation slightly and observe how it behaves. Instead of having 4 features, fixed order, and a limited amount of choices per features, let’s just define an alphabet with all the letters, digits and special characters that appear in our initial feature choices. These would be:
Lowercase : [‘a’, ‘b’, ‘c’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’, ‘k’, ‘l’, ‘m’, ’n’, ‘o’, ‘p’, ‘r’, ‘s’, ‘t’, ‘y’]
Digits : [‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘7’, ‘8’, ‘9’]
Special : [‘!’, ‘@’, ‘$’, ‘%’, ‘*’, ‘_’]
Uppercase : [‘A’, ‘E’, ‘T’, ‘P’]
To make this new problem as close to the previous one as possible, we will constrain the maximum length of the password to be 15 since this is the max length from our passwords dataset. Additionally, let’s define another imaginary character called <NULL>. Since the model that will be described below will have to fill all 15 positions of the password the <NULL> character is just a way for the model to say put nothing here. Therefore from our perspective, any <NULL> characters will be ignored.
With all these changes, the alphabet now contains a total of 19+8+6+4 + 1= 38 characters (uppercase + lowercase + digits + special characters + null char). As a result, there are 38¹⁵ possible combinations in our sample space! Of course, this is a significantly larger sample space than previously but let’s observe how the Naive-Bayes model behaves in these circumstances.
The list below illustrates passwords generated using the old Naive-Bayes model in the current problem context:
Can you spot the issues here? Since the relies on the conditional independence assumption, each character position is inferred independently, therefore there is not a single word in the passwords that makes sense. This would make it extremely it difficult to remember the password, and it’s not what we want.
Recall that in Part 1, there were constraints on the output format but also on the input. We specifically asked the model to do some sampling based on a finite set of choices which unavoidably makes sense and they were indeed independent! Nonetheless, this time, the sample space is incredibly huge. From all the possible combinations, only a tiny part resembles a word or follows the format that has been defined in the beginning since every letter is depended on the previous ones. To give you an example, the probability of having an “e” in the fourth position of our password sequence should be greater when the previous three letters are “caf” rather than when they are “pas”! Hopefully, this somewhat explains why statistical models such as Naive-Bayes won’t perform well on unstructured data.
Deep models for Representational Learning
Yet, this does not mean that we can’t work with unstructured data. Deep Learning has received an incredible amount of research and industry attention in recent years and it has been evident that unstructured data is an area that Deep Learning models excel at! Undoubtedly, this is because of their ability to model the underlying structure/dependencies of the input data.
But how is this even possible you may wonder. Well, The deep-stacked layering of neurons in a trained deep learning model can be conceived as a mapping from a higher-dimensional space to a lower-dimensional latent space (dimensionality reduction). In such a space, it becomes easier to model the relationships amongst data points and it may be even possible to perform arithmetic operations. For example, a widely used application of latent space representation is the generation of word embeddings using deep learning models.
You can learn more about word embeddings in my older blog post here.
More formally, this is called representation learning and is something that as humans, we use it constantly without even noticing. Still, it is an extremely difficult task for a machine. As an example, recall the last time you attempted to describe a person to someone. Most likely, you mentioned hair colour, body type, clothing, eye colour, skin colour etc… and immediately the person that you had been talking to, could imagine a picture of the person you were describing! All the features that were used for this description are low dimensional features that our brain knows how to translate to a higher level dimension, or in other words, an image! In the same manner, the sensory input of the brain comes in a high-dimensional representation but the human cognitive ability allows us to translate those to lower-dimensional features! Isn’t that exciting??
Autoencoder is a neural network architecture which can be more precisely described as an encoder and a decoder connected together. An encoder is a smaller neural network that maps high dimensional data to a smaller latent space while preserving the principal features of the original data. In the same manner, a decoder is simply a reversed encoder. It takes input from a latent space and attempts to recreate the original data.
Notice that the narrowest part of the network is in the middle. This part is otherwise known as the “bottleneck” of the network and it is essentially what forces the autoencoder to discard some information and store only important features, rather than just passing along the information from input to output.
In a generative modelling context, the network is trained as a whole, but only the decoder part is needed for generating data. The training is performed by minimizing the error of the original input with the reconstructed output.
I will skip the formal description of how the network works in this post since my main purpose is to highlight the differences from probabilistic learning, but Stanford University has an excellent tutorial about autoencoders in case you would like to learn more.
Let’s revisit the password generation example, develop an autoencoder that can generate passwords for us and observer how that compares with the independent Naive Bayes model.
Just a heads up, all the code for the next sections can be found here:
So feel free to skip the rest if you are here for the code, but I would suggest opening the notebook up and following along, as the code will make much more sense if it is read along with the rest of the post.
One Hot Encoding
Initially, the passwords have to be converted to vectors in order to be passed as an input to any kind of deep learning model. One of the most well-known techniques for such a task is one-hot encoding. In general, one hot encoding is suitable for categorical data and in layman terms it represents an item as “one-of-X” where is X is an instance of all the possible categories. In the context of textual data, it is assumed that there is a fixed vocabulary and each token is represented as a sparse vector of the same length as the vocabulary where the number 1 indicated that the token is an instance of the class corresponding to related vector index.
Let’s see an example to clear any confusion. Assuming that my vocabulary includes only 5 characters (“a”, “b”, “c”, “d”, “e”), then the word “bad” in one hot encoding can be represented as:
one_hot_encoding("bad") = [[0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 1, 0]]
As described earlier, autoencoders usually follow a standard stacked architecture but there are many parameters that can be tweaked such as the number of stacked layers, the number of cells, the way of connecting the cells etc… In addition, autoencoders can adopt an asymmetric architecture where the decoder has a different structure than the encoder. Whatever the case, the concept remains the same.
For our password generation scenario, I adopted a BiLSTM Autoencoder architecture as illustrated in the figure below. BiLSTM layers are simply the Bidirectional version of normal LSTM layers. The main reason behind this choice was to utilise a model that excels at capturing sequential dependencies, but any other sequential model is applicable to what is described here as well.
For implementing the model above, Keras Functional API with Tensorflow backend was used. More precisely, the passwords are one hot encoded as described above and then pass through 3 BiLSTM layers with 16, 10 and 6 units respectively. Behind the scenes though, the output dimension is doubled since the first layer, for instance, outputs 16 units for one direction and 16 for the reverse direction. Finally, the input passes through a fully connected dense layer with 6 units which is the bottleneck. In other words, by the time a password reaches the bottleneck, it is represented as a 6-dimensional vector. Next, the decoder takes this 6-dimensional vector and attempts to reconstruct the password.
The training of the model has been performed on a dataset of 1000 passwords that have the structure described in Part 1 of this tutorial. After training for 300 epochs, the model reached an accuracy of 95%.
Note that the size of the bottleneck layer has a major role in how accurately the model can reconstruct the original data since it ultimately dictates how much information is maintained during the encoding stage. More complex data would require a larger bottleneck size in order to capture all the nuances that may carry. Nonetheless, considering that the end goal, in this case, is to generate new passwords, 6 dimensions seemed enough and indeed the reconstruction accuracy is acceptable.
Inspecting the results more closely, it is evident that the model did a decent job reconstructing the passwords from just 6 values!
Now is the time for the big act! Let’s see our awesome model in action and generate some passwords. For this task, only the decoder of the model is required since we will effectively sample 6 random values and feed them into the decoder to get a password:
The results are interesting. First of all, it is obvious that the Autoencoder does a better job generating passwords than the Naive Bayes model for sure!
This should be the “Aha!” moment realising how powerful deep learning can become when identifying hidden correlations in unstructured data!
There are some pretty usable passwords in there even with words that were never in the original passwords such as
Yet, most of them are quite off. Passwords 1, 2, 3, 5, 6 and 7 are totally non-sensical. But what is the issue that’s causing this really? Let’s examine what is going on further bu visualizing the latent space.
By using t-SNE (a technique for visualizing high dimensional data) on 200 passwords in the latent space, we can indeed observe that similar passwords cluster together (i.e.
hackerman passwords are all in the top left corner, further below we have
ninja and so forth…).