Decentralized and Scalable Multi-Agent Reinforcement Learning

When we think about training or learning processes in deep learning solution we typically visualize centralized models. In those architectures a series of central nodes collect and curate datasets which are used to train the models that are deployed across different nodes in a network. Even in distributed scenarios such as multi-agent reinforcement learning(MARL) that can include tens of thousands of nodes running a model the learning models rely on a handful of centralized nodes.

Centralized learning is conceptually simple to implement but incredibly hard to scale. Imagine an internet of things(IOT) scenario with hundreds of thousands of devices collecting data and executing a reinforcement learning model. If each agents needs to collect the data, send it to a central server and interact with it to optimize its learning policy, the complexity of the architecture increases linearly with the number of agents. Furthermore, in many distributed scenarios, we would like agents to learn and optimize their policies real time which is almost impossible to achieve with centralized models. Recently, researchers from artificial intelligence(AI) powerhouse published a paper in which they introduced a method for what they called “Distributed Actor-Critic Reinforcement Learning”. The proposed learning method is called Diff-DAC and I prefer to refer to it as decentralized learning as it targets MARL topologies that are not only distributed but they lack central coordinators.

The Task Similarity Learning Principle

Multi-agent reinforcement learning(MARL) scenarios are, practically speaking, one of the most complex deep learning architectures to implement. Game theory, distributed programming and unsupervised learning all collide in MARL scenarios to create an incredibly challenging environment for data scientists and developers. Consider a MARL models with hundreds of thousands of nodes that can learn several tasks. In a typical centralized MARL topology, the complexity of the architecture is dictated by two disjointed factors: the number of nodes and the number of tasks. As more nodes are added to the network the communication with the centralized coordinator becomes more complex. As the agents need to learn new tasks, the central coordinator is forced to coordinate learning policies across arbitrary number of nodes in the network.

Diff-DAC is based on a very simple but incredibly powerful observation: “Similar tasks in MARL scenarios tend to have similar learning policies”. When adjusting temperatures in a wireless network of thermostats, for instance, or setting meeting agendas via virtual assistants, tasks can be enough alike that they can be performed using similar policies. I like to call this insight the Task Similarity Learning Principle and it can lead to powerful optimization models in MARL scenarios.


The Task Similarity Learning Principle basically means that, if an RL agent learns a specific task policy, other agents in the network performing similar tasks can leverage that policy. Leveraging that idea, Diff-DAC structures a MARL topology as a connected graph in which there are paths between nodes performing similar tasks. In that model, each agent learns from data gathered and processed for its own task. It then exchanges learned parameters with only its closest neighbors, so that all agents benefit from their neighbors’ learning processes. The following graph illustrates the Diff-DAC MARL approach. Colors in the graphic represent the spreading local consensus of learned parameters through the network. Eventually, the network would converge to a single solution (and color) for all the tasks.

The Diff-DAC architecture is completely decentralized. The model replaces a central coordinator with a connected graph in which the agents learn independently and then share some intermediate parameters with their neighbors. By communicating with each other, nearby agents tend towards consensus. As information is diffused across the network, every agent benefits from every other agent’s learning process. Since agents can only communicate with their neighbors, the computational complexity and communication overhead per agent grow linearly with the number of neighbors instead of the total number of agents.

The Results

The Prowler team used OpenAI Gym to benchmark Diff-DAC against a group of state-of-the-art multi-agent reinforcement learning(MARL) algorithms. The experiments were based on classic MARL scenarios such as Cart-Port Balance or the Inverted Pendulum. In most cases, Diff-DAC was able to match and usually outperform the results obtained with the centralized architectures. Even when the centralized models were able to learn the policies faster than Diff-DAC, the latter exhibited better performance and less variance.

Decentralized learning models such as Diff-DAC are going to be key to implement reinforcement learning scenarios at scale. The emergence of technologies such as blockchains and distributed ledgers as well as the improvements in security models such as homomorphic encryption are contributing to take decentralized deep learning closer to reality. MARL scenarios, seems like an obvious place to start.

Source: Deep Learning on Medium