Learn to Schedule Communication between Cooperative agents

Source: Deep Learning on Medium


In multi-agent environments, one way to accelerate the coordination effect is to enable multiple agents to communicate with each other in a distributed manner and behave as a group. In this article, we discuss a multi-agent reinforcement learning framework, called SchedNet proposed by Kim et al in ICLR 2019, in which agents learn how to schedule communication, how to encode messages, and how to act upon received messages.

Problem Setup

We consider multi-agent scenarios wherein the task at hand is of a cooperative nature and agents are situated in a partially observable environment. We formulate such scenarios into a multi-agent sequential decision-making problem, such that all agents share the goal of maximizing the same discounted sum of rewards. As we concern a method to schedule communication between agents, we impose two restrictions on medium access:

  1. Bandwidth constraint: The agent can only pass L bits message to the medium every time.
  2. Contention constraint: The agents share the communication medium so that only K out of n agents can broadcast their messages.

We now formalize MARL using DEC-POMDP(DECentralized Partially Observable Markov Decision Process), a generalization of MDP to allow a distributed control by multiple agents who may be incapable of observing the global state. We describe a DEC-POMDP by a tuple <S, A, r, P, 𝛺, O, 𝛾>​, where:

  • s ∈ S is the environment state, which is not available to agents
  • aᵢ ∈ A​ and oᵢ ∈ 𝛺​ are the action and observation for agent ​i ∈ N
  • r: S A^N → R is the reward function shared with all agents
  • P:S A^N → S​ is the transition function
  • O: S ⨉ N → 𝛺​ is the emission/observation probability
  • 𝛾​ denotes the discount factor



Figure.1 Architecture of ScheduleNet with two agents. Each agent has its own observations and networks that do not share with others. We use bold-face fonts to highlight aggregate notations of multiple agents

Before diving into details, we first take a quick look at the architecture(Figure1) to get an overview of what’s going on here. At each time step, each agent receives its observation, and pass the observation to a weight generator and an encoder to produce a weight value w​ and a message m, respectively​. All weight values are then transferred to a central scheduler, which determines which agents’ messages are scheduled to broadcast via a schedule vector c=[cᵢ]ₙ, cᵢ ∈{0, 1}​. The message center aggregates all messages along with the schedule vector ​c and then broadcasts selected messages to all agents. At last, each agent takes an action based on these messages and their own observations.

As we will see next, SchedNet trains all its components through the critic, following the decentralized training and distributed execution framework.

Weight Generator

Let’s start with the weight generator. The weight generator takes observation as input and outputs a weight value which is then used by the scheduler to schedule messages. We train the weight generator through the critic by maximizing Q(s,w)​, an action-value function. To get a better sense of what’s going on here, let’s take the weight generator as a deterministic policy network, and absorb all other parts except the critic into the environment. Then the weight generator and critic will form a DDPG structure. In this setup, the weight generator is responsible for answering the question: “what weight I generate could maximize the environment rewards from here on?”. As a result, we have the following objective

The objective for the weight generator, where we use bold-face fonts to highlight aggregate notations of multiple agents as we did in Figure1.

It is essential to distinguish s from o; s is the environment state, while o is the observation from the viewpoint of each agent.


Back when we described the problem setup, two constraints were imposed on the communication process. The bandwidth limitation L​ can easily be implemented by restricting the size of message m​. We now focus on imposing K​ on the scheduling part.

The scheduler adopts a simply weight-based algorithm, called WSA(Weight-based Scheduling Algorithm), to select K​ agent. We consider two proposals from the paper

  1. ​Top(k): Selecting top k​ agents in terms of their weight values
  2. Softmax(k)​: Computing softmax values​ for each agent i based on their weight values​, and then randomly selecting k​ agents according to this softmax values

The WSA module outputs a schedule vector c=[cᵢ]ₙ, cᵢ ∈{0, 1}​, where each cᵢ​ determines whether the agent ​’s message is scheduled to broadcast or not.

Message Encoder, Message Center, and action selector

The message encoder encodes observations to produce a message ​m. The message center aggregates all messages m, and select which messages to broadcast based on ​c. The resulting message mc​ is the concatenation of all selected messages. For example, if m=[000, 010, 111]​ and c=[101]​, the final message to broadcast is ​mc=[000111]. Each agent’s action selector then chooses an action based on this message and its observation.

We train the message encoders and action selectors via an on-policy algorithm, with the state-value function V(s)​ in the critic. The gradient of its objective is

where 𝜋​ denotes the aggregate network of the encoder and selector, and V​ is trained with the following objective


Two Different Training Procedure?

The authors train the weight generators and action selectors using different methods but with the same data source. Specifically, they train the weight generators using a deterministic policy-gradient algorithm(an off-policy method), while simultaneously training the action selectors using a stochastic policy-gradient algorithm(an on-policy method). This could be problematic in practice since the stochastic policy-gradient method could diverge under the training with off-policy data. The official implementation ameliorates this problem using a small replay buffer of ​ transitions, which, however, may impair the performance of weight generator training.

We could bypass this problem by reparameterizing the critic such that it takes as inputs state s​ and actions a₁, a₂, …​ and outputs the corresponding ​Q-value. In this way, we make both trained with off-policy methods. Another conceivable way is to separate the training process from environment interaction if one insists on stochastic policy-gradient methods. Note that it is not enough to simply separate the policy training since the update of the weight generator could change the environment state distribution.