Source: Deep Learning on Medium
Is Your NLP Model Robust?
Let me ask you a question: I tell you that “John killed Mary”. Based on that statement, would you infer the statement “Mary Killed John” as true?
Well, Even a five year old with minimal language understanding probably would not make this mistake. However some of the state of art NLP systems predict with 92% confidence that “John killed Mary” entails “Mary killed John”.
Let us forgive our NLP models for this silly mistake since Natural Language Inference is typically a hard task. Let us instead try a simpler task on our NLP models namely sentiment analysis.
“A great white shark bit me on the beach” — Is this sentence positive or negative?
Try this sentence on a sentiment analysis system, Many of the sentiment analysis systems including that of Google Cloud-NLP, Microsoft’s Text Analytics services or even Stanford Deep Learning based Sentiment Model predict this as carrying a high positive sentiment. However this sentence does not carry a positive sentiment, as any human being will easily determine.
Why are our state of the art NLP systems so brittle?
Current NLP systems are based on associative learning. Consider a sentiment analysis system for movie reviews. The user sees a movie, likes it and wants to rate it positively. He writes a review which conveys his positive feeling towards the movie. The cause is his positive sentiment towards the movie and the manifestation/effects of that cause is the review written by the user.
Instead of learning “what comprises of the sentiment of this review”, the NLP models are tasked to learn “which reviews are likely to be associated with a positive label and which reviews are associated with a negative label”?
If Correlation Does Not Imply Causation, What Can My Model Do?
NLP models are shown a huge set of sentences labelled as positive or negative. They need to learn implicitly what are the characteristics/features of a sentence which associates with it, a positive or negative label. The features identified by the model to be predictive of the label, based on the training data it has seen can be either semantically relevant (features which actually reflect the underlying sentiment) or they can be spurious associative patterns.
The NLP model itself does not know which features are truly reflective of the underlying semantics vs which features are spurious associative patterns. It can only learn all the features it identifies in the training data, which are predictive of the associated label, without any idea of whether they are relevant or spurious.
To make this idea concrete, think of a sentiment analysis dataset, where all references to Donald Trump are negatively labelled sentences. The model ends up learning that the word “Trump” is highly associated with a negative sentiment. If it is shown a test sentence containing the word “Trump” such as “this story is just like Trump”, it is likely to label it as negative in the absence of any other sentiment bearing words.
Consider our earlier example: “A great white shark bit me on the beach”. The NLP model is likely to have seen the word “great” and “white” in positive sentiment bearing sentences. So it associates them as predictive of positive sentiment. Given the infrequent occurrence of the word “bit” as a verb, the model may not be able to attribute a sentiment to it. So it ends up using the spurious pattern “great white” to associate a positive label wrongly with this sentence.
Is My NLP Model Worse Than a 5 Year Old in Natural Language Inference?
Think about how NLP models wrongly conclude that “John killed Mary” implies “Mary Killed John”. The possible reason is yet another spurious surface pattern in the data used to train the NLP model. Most of the examples which are labelled as textual entailment exhibit a high degree of lexical overlap. Hence NLP models learn a spurious association between degree of lexical overlap and entailment label. This causes the model to wrongly predict the above example as entailment.
How can we make our NLP models robust?
Given that NLP models are brittle because of these spurious cues/patterns they have incorrectly picked up from the training data, what can we do to make our models resistant to these shallow statistical patterns? We will discuss more about this in our next post. Meanwhile if you are interested in exploring more on this topic, here are a few related papers to look at: