Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols

Serhii Havrylov and Ivan Titov (ML Research Partners from University of Edinburgh/University of Amsterdam)

Language has been an essential tool for human civilization to transfer knowledge to new generations. The origin of language has been captivating people’s minds for centuries and has given rise to several studies.

However, until recently, almost all mathematical models to examine its emergence had to be restricted to low dimensional, simple observation spaces due to algorithmic and computational limitations. In the past years, the deep learning community has shown considerable interest in this problem. In the following post, we lay out our main contributions in the field of linguistics and machine learning formed during our joint research project with the machine learning research team at SAP.

Playing a referential game

One of the most basic challenges of using a language is to refer to specified things. Thus, it is not surprising that a referential game is a go-to setting in the learning-to-communicate field. Consisting of a number of confined interactive reasoning tasks, these games are used to examine the pragmatic inference of machines in a controlled setting. While many extensions to the primary referential game are possible, we decided to proceed with the following game setup:

  1. A target image is chosen from a collection of images with \(K\) distracting images.
  2. There are two agents: a sender and a receiver.
  3. After seeing the target image, the sender has to come up with a message that is represented by a sequence of symbols from the vocabulary of a fixed size. There is a maximum possible length of sequence.
  4. Given the generated message and the set of images consisting of distracting images and the target image, the receiver should identify the correct target image.

Consequently, in order to succeed in this referential game, the sender has to carefully choose the words and put them in a sequence that will make it easy for the receiver to correctly identify what image was shown to the sender. The setting is essentially different to previous studies in this area as our approach for instance uses sequences rather than single symbols to generate messages, which makes our setting more realistic and challenging from the learning perspective.


Both agents, sender and receiver, are implemented as recurrent neural networks, namely long short-term memory networks, which are one of the standard tools for generating and processing sequences. The figure below shows the sketch of a model where solid arrows represent deterministic computations. Dashed arrows depict copying a previously-obtained word. And lastly, diamond-shaped arrows represent sampling a word from the vocabulary.

Probably, this is the most important and most troublesome part of the model. On the one hand, it is a crucial element because this is the place where a sender makes decisions about what to say next. On the other hand, it is troublesome because it is stochastic. Unfortunately, an ubiquitous backpropagation algorithm relies on having chains of continuous differentiable functions in each of the layers of the neural network. However, this particular architecture contains non-differentiable sampling from the discrete probability distribution, which means that we can’t use backpropagation right away.

The visual system of a sender is implemented as a convolutional neural network (CNN). In our case, images are represented by outputs of the penultimate hidden layer of the CNN. As you can see from the figure above, a message is obtained by sequentially sampling until the maximum possible length is reached or the special token “end of a message” is generated.


It is relatively easy to learn the behavior of a receiver agent within the context of the referential game. Since it is end-to-end differentiable, gradients of the loss function with respect to its parameters can be estimated efficiently. The real challenge is to learn the sender agent. Its computational graph contains sampling, which makes it non-differentiable. As a baseline, we implemented a REINFORCE algorithm. This method provides a simple way of estimating gradients of the loss function with respect to parameters of the stochastic policy. Even though it is unbiased, it usually has a huge variance and this fact slows down the learning of a model. Fortunately, last year two groups independently discovered a biased but low-variance estimator — the Gumbel-Softmax estimator (GS estimator). It allows to relax an original discrete variable with its continuous counterpart. This makes everything differentiable, which allows the application of a backpropagation algorithm. As this topic is quite big and deserves its own post, we encourage you to read a blog post from one of the authors of this method.

Our findings

The first thing we examined after learning the model was the communication success rate. We consider communication between two agents successful when the target image is identified correctly. As one can see from the figure below, the results using the Gumbel-Softmax estimator (red and blue curves) are better than those of the REINFORCE algorithm (yellow and green curves), except when agents are allowed to communicate only using one word.

We assume that in this relatively simple setting, the variance of REINFORCE is not an issue and the property of being unbiased pays off. At the same time, the bias of the GS estimator drifted it away from the optimal solution. Also, this plot goes hand in hand with intuition and clearly shows that by using more words one can describe an image more precisely.

We also investigated how many interactions between the agents have to be performed to learn the communication protocol. Much to our surprise, we saw that the number of updates required to achieve training convergence with the GS estimator (green curve) decreases when we let a sender use longer messages. This behavior is slightly counterintuitive as one could expect that it is harder to learn a protocol, when the search space of the communication protocols is larger. In other words, using longer sequences helps to learn a communication protocol faster. However, this is not the case for the REINFORCE estimator (red curve): it usually takes five-fold more updates to converge compared to the GS estimator. Also, there is no clear dependency between the number of updates needed to converge and the maximum possible length of a message.

Moreover, we plot the perplexity of the encoder, which arguably measures how many options a sender has to choose from in each of the time steps while sampling from the probability distribution over vocabulary. We could see that for the GS estimator (green curve), the number of options is relatively high and increasing with sentence length, whereas for the REINFORCE algorithm (red curve) the increase of perplexity is not as rapid. This implies redundancy in the encodings, meaning that there exist multiple paraphrases encoding the same semantic content.

How does the learned language look like? Aiming to better understand the nature of this language, we inspected a small subset of sentences with maximum possible message length equal to 5 units that were produced by the model. First, we took a random photo of an object and generated a message. Then we iterated over the dataset and randomly selected images with messages that share prefixes of 1, 2 and 3 symbols with the generated message.

For example, the first row of the left figure using a subset of animal images shows some samples that correspond to the code (5747 * * * * ). Here “*” means any word from the vocabulary or end-of-sentence padding.

However, it seems that images for the ( * * * 5747 * ) code do not correspond to any predefined category. This suggests that the word order is crucial in the developed language. Particularly, word 5747 in first position encodes presence of an animal in the image. The same figure shows that message (5747 5747 7125 * * ) corresponds to a particular species of bears, which puts forward that the developed language implements some kind of hierarchical coding. This is of high interest as the model hadn’t been explicitly constrained to use any hierarchical encoding scheme. Presumably, this scheme can help the model to efficiently describe unseen images. Nevertheless, natural language uses other principles to ensure compositionality. The model seems to be generally applicable as it shows similar behavior for images in the food domain (right image in the figure above).

In our study, we have shown that agents modeled using neural networks can successfully invent an efficient language that consists of sequences of discrete tokens. We also found that agents can develop a communication protocol more quickly when we allow them to use longer sequences of symbols. Moreover, we observed that the induced language implements a hierarchical encoding scheme and there exist multiple paraphrases that encode the same semantic content. In future work, we would like to extend this approach to modeling goal-oriented dialogue systems.

Chatbots and conversational AI platforms have become increasingly important in the enterprise sphere, especially in the banking, insurance and telecommunications sector. However, current approaches to build these technologies still rely on extensive human supervision. Humans either need to construct rules or provide examples of successful dialogs, which are used to train the intelligent assistants. This is hard to scale to complex tasks as quality supervision is expensive and time-consuming. Moreover, human approaches might be inconsistent or there might be more effective ways to solve the tasks. Our approach holds the promising potential to substitute or supplement this standard scenario: chat bots could then use feedback on task completion enabling additional cost-effective supervision. At some point, this might help to build successful digital assistants in shorter time and with less expenses. We also expect that this would enable machines to cope with new scenarios and changes in existing settings without explicit human intervention or the necessity of new data sets.

We presented our work at NIPS’17. For more information and the technical details of our study, please check: Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols.

Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols was originally published in SAP Leonardo Machine Learning Research on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium