Original article was published on Deep Learning on Medium
Using NLP to Summarize Human Thoughts & Feelings
We all want to share our thoughts and feelings with someone that gets us, someone that can mirror our emotions and help us reflect more deeply. It seems simple enough. While I have friends and partners to confide in, they are human and subject to their own constraints and biases. This is why I’m building a machine that understands me, helps me understand myself better and is there for me whenever I need it.
As the lead machine learning engineer at Maslo, I’m in charge of creating the machine learning models that make our digital beings come to life. My goal and passion is to create a digital being that people feel comfortable sharing their feelings with, that they can confide in, and that can help them grow.
To create a digital being that’s empathetic I had to first look into the various ways that humans communicate their thoughts and feelings to each other. Humans are amazing, the progression of human conversation with different body language cues, intonations, pauses, and so many signals is marvelous… and it’s made possible by the greatest communication tool of all: human language.
When you talk to me, I’m able to listen to and extract the important points of what you’re saying and come up with a good response based on my understanding and previous knowledge. This process of listening and synthesizing others’ thoughts is crucial to having a fluid, empathetic conversation, so I went on a mission to build machines that are able to understand natural language and summarize it. Here’s a quick dive into what I’m doing:
Introduction to text summarization
Natural Language Processing is a type of Machine Learning concerned with the interactions between computers and human languages, in particular, how computers process, read, decipher, understand, and make sense of human languages. On the technical side, NLP involves preprocessing and tokenizing linguistic datasets and running them through a neural network for learning. Eventually, after lots of training, the machines learn how to perform the task demanded on the data you feed them. At Maslo, we’re teaching our machines to synthesize human thought and emotional data.
One of the most successful and commonly-used types of NLP is machine translation, wherein machines translate text or speech from one language to another. Text summarization uses the same methods as, and is a subset of, machine translation. The purpose of text summarization is to translate a large linguistic corpus into a smaller “summary” corpus.
How text summarization works
In a lot of traditional machine learning, an input sequence returns an output value, or an input value returns an output sequence. In contrast, text summarization relies on a sequence to sequence (Seq2Seq) architecture. Seq2seq maps sequences of different lengths to each other. In the case of summarization, you’re taking a larger text example and transforming it into a smaller one.
The architecture of Seq2seq uses a Recurrent Neural Network (RNN), Long Short Term Memory Network (LSTM) or Gated Recurrent Unit (GRU). In this setup, each component feeds into the next one such that the context for each item is the resulting output from the previous step. The architecture consists of one encoder and one decoder network. The encoder turns each item into a corresponding hidden vector containing the item and its context. The decoder reverses the process, turning the vector into an output item using the previous output as the input context.
Why is this a hard problem?
Building something that reliably summarizes human language is tricky. While extracting meaning from a paragraph presents its own challenges, there are existing challenges with the training data.
Machine translation models like Google Translate perform well because there’s a tremendous corpus of manually translated text that we can train on and use to tune the models. While text summarization training sets exist, the summaries rarely capture the essence of the full corpus; these datasets are typically open data and are prone to spelling, content and other types of errors.
My experience with text summarization (so far!)
Here’s a look at my training data:
I’ve been working with the Amazon Food Reviews Dataset as my toy dataset, and the CNN/DailyMail dataset as my goal dataset. The Amazon Food Reviews dataset is short and sweet and fun to work with. It’s easy to get decent summarizations with this dataset, even without cloud computing or GPU. While I’ve been able to generate decent summaries with this dataset, it’s too short and incomprehensive for general use on Maslo thought and emotional data.
The CNN/DailyMail dataset, among others, can be used to summarize Maslo thought and emotional data because it takes much longer news articles and summarizes them into a few sentences. This is very similar to taking Maslo users’ journal entries and summarizing them to the key points. However, this dataset is much harder to train, requires GPU and cloud computing, and an embedding to get decent results.
Problems in training
General GPU and Data pipeline problems: the infrastructure is finicky and getting the GPUs to work efficiently can be challenging. Data pipelines need to be carefully calibrated to be maximally efficient so training happens quickly, but also so that they don’t crash the computer by overwhelming memory.
Over/under-training and “exploding gradients”: it is easy to over or under train your model. Additionally, hyperparameters and updating weights during training can cause numerical overflows or underflows. Hyperparameters need to be tweaked to ensure good performance.
Getting better results
That said, there are a few important techniques that can help get better results. These include pre-trained embeddings, specific neural network architectures, and attention mechanisms!
Embeddings are a neural network layer that takes a lot of common words and maps out how they’re related, helping the algorithm learn quicker. There are some pre-trained embeddings like GloVe (Global Vectors for Word Representation), which is one of the best and most widely used embeddings. GloVe is an unsupervised learning algorithm that obtains vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. So basically, GloVe is like a giant word matrix that maps out word similarity.
Attention mechanisms allow neural networks to learn what’s important and encode/decode less of each input so it works more efficiently. Attention takes two vectorized sentences, turns them into a matrix where the words of one sentence form the columns, and the words of another sentence form the rows; it then makes matches, identifying relevant context. Attention allows you to look at the totality of a sentence and make connections between any particular word and its relevant context, discarding the rest.
Transformers Architecture is a model that uses attention to boost training speed, and lends itself particularly well to parallelization. It’s basically a meta encoder-decoder network with attention.
The encoding component is a stack of encoders. The decoding component is a stack of decoders of the same number. The encoders are all identical and do not share weights. Each one is broken down into two sublayers: a feed forward neural network and an attention mechanism. The encoder’s inputs first flow through a self-attention layer — a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. The outputs of the self-attention layer are fed to a feed-forward neural network. The same network is independently applied to each position. The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence. What’s unique about Transformers is that the token in each position flows through its own path through the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, so the various paths can be executed in parallel while flowing through the feed-forward layer.
My progression from word salad to good summaries
With ML training it makes sense to start with the simplest iteration of your model and scale up to more complex models. I started by laying the groundwork with good GPU infrastructure and efficient data pipelines. Then, I wrote and ran a vanilla Seq2seq. Then I added an embedding, followed by an attention mechanism. After gauging results, I try different architectures like Transformers to see what performs best.
- Vanilla Seq2seq — I got some OK summarizations with the Amazon food dataset on vanilla Seq2seq, but it was insufficient for the scope of the CNN/Dailymail dataset. At Maslo, we are particularly fond of accidentally poetic ML output.
- Seq2seq w/embedding — I started getting more interesting summarizations from this configuration. Because of the embedding, the network understood that “jerky” was like a “slim jim” on Amazon food reviews dataset. It summarized a review for a vegetarian jerky as “good veggie slim-jim,” which was impressive. “Slim-jim” was in the embedding, driving the content of that summary. I also got an interesting response where someone had mixed reviews of a cookie: the summary was “Great Cookie But”.
- I was originally over-training with the CNN dataset, which resuled in garbled output. I increased dropout and the learning rate, and began getting better results, but that created an exploding gradient problem. I used gradient norm scaling and gradient value clipping to deal with that problem. However, I’m still looking for the magic hyperparameter values that don’t cause overfitting, crash my training, or create an exploding gradient.
- I’m also adding an Attention Mechanism to my Seq2seq with GloVe. This is challenging, because you need to match, reshape, and pad tensors, which can be tricky. Furthermore, padding the tensors makes them bigger and can run you out of memory.
- I haven’t gotten started on the Transformers Network yet but that’s the next step. With Transformers, I might be able to include capitalized letters and punctuation in the training data as the previous Seq2seq networks are unable to handle that many tokens.
Conclusion — The road towards empathetic machines
The path toward building empathetic machines is challenging but attainable. We can use a branch of Natural Language Processing called Text Summarization to help machines synthesize human thoughts and emotions. Text Summarization uses a Seq2seq architecture to translate a large corpus of text into a smaller “summary” corpus. In its current state, text summarization presents the challenges of insufficient and error prone data. However, we can use machine learning techniques — namely embeddings like GloVe, attention mechanisms, and Transformers architecture — to significantly improve results.
There are new techniques coming out every day, allowing developers to iterate for better and better performance. For now, with attention and transformer architecture, we know that machines can understand the key points of a sentence or thought; one day it would be amazing for them to empathize with the user. We’re not there yet, but it’s coming. It’s an exciting time to be working with and perfecting these machine learning techniques that not only understand, but empathize with humans.
P.S. I used my machines to summarize my piece into this conclusion. Just kidding, but I’ll get there!