Original article was published on Artificial Intelligence on Medium
Text is a form of sequential data, and like other projects involving sequential data, sequential models, such as RNNs, LSTMs, and GRUs can be used.
Since the texts we are using contains uppercase letters, punctuation, numbers, and symbols, a word-based model would not be able to access the original text, assuming we do preprocessing: removing punctuation, tokenization, stemming or lemmatization, and removing stop words. This means information is lost, and as a result, the model cannot produce text like the original.
A character-based model would not struggle with this issue, since each punctuation, number, and symbol could be treated as a separate character. In fact, the model would learn the formatting and punctuation patterns of the original text.
For my model, I used a character-based model.
If you have worked with natural language processing, you have likely come across the idea of word embeddings, the process of converting a word to a multi-dimensional vector. The embedding (multi-dimensional vector) captures ideas such as:
- The context words appear in
- Analogies between pairs of words
- The frequency of certain words
- Relationships between similar words
Instead of a one-hot vector, word embeddings give the model a meaningful vector to perform calculations upon. Character embeddings perform a similar function, and turn a one-hot vector into something the model can efficiently work with.
There are a few pretrained character embeddings online, but I chose to train my own.
The machine learning model I used consisted of the following layers:
- Embedding: Learns embeddings for characters, and gives the next layers relevant information.
- Stacked GRUs: Learns when to remember information, and when to forget information. Produces outputs for each input time step.
- Regular Neural Network Layer: Takes the output from the last GRU and assigns a probability to each character with the softmax activation function.
To create the inputs and outputs, I concatenated all my training examples into a single string, and split that string into many sequences of equal length. One training example would consist of a single sequence as input, and a shifted sequence as the target.