Source: Deep Learning on Medium
What do you learn from context?
The fact that a model like DAN is as good as the transformer raises question — whether our models are taking care of the ordering and is ordering as important as we thought?
Let’s discuss what do we learn from the context? In this paper, authors try to understand where these contextual representations improve over conventional word embeddings.
Tasks taken for evaluation
Authors introduce a suite of “edge probing” tasks designed to probe the sub-sentential structure of contextualized word embeddings. These tasks are derived from core NLP tasks and encompass a range of syntactic and semantic phenomena.
They use the tasks to explore how contextual embeddings improve on their lexical (context-independent) baselines. They focus on four recent models for contextualized word embeddings–CoVe, ELMo, OpenAI GPT, and BERT.
ELMo, CoVe, and GPT all follow a similar trend (Table 2), showing the largest gains on tasks which are considered to be largely syntactic, such as dependency and constituent labeling, and smaller gains on tasks which are considered to require more semantic reasoning, such as SPR and Winograd.
How much information is carried over long distances (several tokens or more) in the sentence?
To estimate information carried over long distances (several tokens or more), authors extend the lexical baseline with a convolutional layer, which allows the probing classifier to use local context. As shown in Figure 2, adding a CNN of width 3 (±1 token) closes 72% (macro average over tasks) of the gap between the lexical baseline and full ELMo; this extends to 79% if we use a CNN of width 5 (±2 tokens).
This suggests that while ELMo does not encode these phenomena as efficiently, the improvements it does bring are largely due to long-range information.
The CNN models and the orthonormal encoder perform best with nearby spans, but fall off rapidly as token distance increases. (The model can access only embeddings within given spans, such as a predicate-argument pair, and must predict properties, such as semantic roles, which typically require whole-sentence context.)
The full ELMo model holds up better, with performance dropping only 7 F1 points between d = 0 tokens and d = 8, suggesting the pretrained encoder does encode useful long-distance dependencies.
Findings of the paper
First, in general, contextualized embeddings improve over their non-contextualized counterparts largely on syntactic tasks (e.g. constituent labeling) in comparison to semantic tasks (e.g. coreference), suggesting that these embeddings encode syntax more so than higher-level semantics.
Second, the performance of ELMo cannot be fully explained by a model with access to local context, suggesting that the contextualized representations do encode distant linguistic information, which can help disambiguate longer-range dependency relations and higher-level syntactic structures.