[Paper] Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens (Rei and…

Source: Deep Learning on Medium

[Paper] Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens (Rei and Søgaard, 2018 NAACL)

Link to paper.

I’ve recently been working on a project in my lab regarding metaphor detection using NLP techniques. It’s proven to be a nontrivial problem, and surprising to me an active field of research.

Today I’ll be taking a look at the paper Zero-Shot Sequence Label: Transferring Knowledge from Sentences to Tokens written by Mr. Marek Rei and Anders Søgaard and published at NAACL 2018. I won’t include the nitty-gritty details, but will just focus on the methodology. You can check the results of details of implementation in the original paper.

The reason that I decided to read this paper is because my team and I are currently trying to formalize our task as a zero-shot learning task. The task setting of this paper doesn’t align with ours 100%, but it never hurts to read.


The authors attempt zero-shot learning at the token level of text. What they do is first train a model to learn how to classify sentences, and then use mechanisms like attention to extract information at the token level. This problem is formulated as a zero-shot learning task, since there are no labels for tokens that were given beforehand for training.

1. Introduction

As mentioned in the summary above, this paper tackles issues of sequence labeling when there are no labels for the tokens. For anyone unfamiliar, sequence labeling is basically the task of labeling the tokens in a sequence (e.g. PoS tagging).

As stated in the paper: “instead of training the model directly to predict the label for each token, the model is optimized using a sentence-level objective.”

The authors admit that this approach will not be able to perform better than the supervised models that are directly trained on tokens. However, the motivation of this paper is that it opens up possibilities for making use of text data where token-level information is either unavailable or difficult to deal with.

2. Network Architecture


The basic architecture is as follows:

The overall flow is:

  1. Get the word embedding representation of each word (w_i).
  2. Run the word embedding w_i through a bidirectional LSTM (h_i) and concatenate the forward and backward hidden state vectors.
  3. Run the concatenation of hidden state vectors through a linear projection followed by a hyperbolic tangent activation function to get the representations e_i. This is run through a linear projection one more time to get \tilde{e_i}.
  4. In more typical situations like machine translation, the attention values a_i are obtained by normalizing the values of \tilde{e_i} via softmax normalization. This stops long sentences from having an “unfair” advantage, since longer sentences would always have larger magnitudes. However, that form of normalization may not be suitable for this setting since it assumes that there is one token with the correct label, which is not the case for sequence labeling. Instead, the authors first run the \tilde{e_i} through a logistic sigmoid function, and perform a softmax-like normalization without the exponential functions.
  5. After the attention values a_i are obtained, they are used to weight the concatenated hidden vector representations c.
  6. c is then run through a linear projection to obtain sentence representation d.
  7. d is run through a final linear projection and passed through a logistic sigmoid function. If the output y of this is above a threshold (0.5 in this case) then the sentence is labeled as “positive.”

Here are the equations for the things I’ve just listed:

Loss Functions

The way the authors came up with the loss functions. There are a total of three loss functions.

The first loss function simply works on the sentence level. Was our model able to correctly classify the label for the sentence?

The second and third loss functions deal with token-level information and are a bit trickier than the first. According to the authors, there are a couple of constraints that they use:

  1. “Only some, but not all, tokens in the sentence can have a positive label.”
  2. “There are positive tokens in a sentence only if the overall sentence is positive.

Remember that \tilde{a_i} are the unnormalized attention weights. The second loss function pushes the smallest of those weights to be close to 0 which basically ensure that not all tokens will have positive labels, and the third loss function encourages the model to assign large attention weights to the the largest token for positive sentences (not how if \tilde{y} is 0, then the maximum weight will also be 0).

The final loss function is:

where γ is basically a hyperparameter that controls how important those auxiliary loss functions are.

There are alternative methods that the authors also conducted experiments with, including gradient-based methods and a simple frequency based naive method. Their method outperforms the others.

Closing Thoughts

Overall, the paper is well-written. The authors could have included a bit more detail regarding how they came up with the loss function, as that does seem to be an important contribution of theirs, but all in all it was very informative.

I hadn’t thought of zero-shot learning in the sequence-labeling setting to be formulated in this way, so +1 for creativity.

This is also the first paper that I decided to do an analysis piece on Medium. I’ve been telling myself that I want to do this, as it’s the perfect way for me to keep myself writing and also study. I’ll try to keep this up, but I definitely need to keep it shorter.