Smart Incentives and Game Theory in Decentralized, Multi-Agent Reinforcement Learning Systems

Source: Deep Learning on Medium


Throughout history, humans have built many systems that require both autonomous actions and coordinated interactions between its participants. Traffic networks, smart grids or stock markets are examples of these systems that have become fundamental pillars of our societies. The essential characteristic of those systems is that they require their participants to act execute autonomous tasks which impact is seen in a shared environment with other participants. Recreating this type of dynamic in artificial intelligence(AI) agents is extremely challenging. One of those challenges is based on balancing the individual interests of AI agents with the ones of the entire group. Recently, researchers from AI solutions company Prowler, published a paper that details an incentive model for the implementation of multi-agent AI systems.

The Prowler research focuses on a deep learning discipline known as multi-agent reinforcement learning(MARL) which has become the state-of-the-art for the implementation of autonomous, multi-agent, self-learning systems.

Decentralized MARL

In the deep learning ecosystem, multi-agent reinforcement learning(MARL) is the area that focuses on the implementation of autonomous, self-learning systems with multiple agents. Conceptually, multi-Agent Reinforcement Learning(MARL) is the deep learning discipline that focuses on models that include multiple agents that learn by dynamically interacting with their environment. While in single-agent reinforcement learning scenarios the state of the environment changes solely as a result of the actions of an agent, in MARL scenarios the environment is subjected to the actions of all agents. From that perspective, is we think of a MARL environment as a tuple {X1-A1,X2-A2….Xn-An} where Xm is any given agent and Am is any given action, then the new state of the environment is the result of the set of joined actions defined by A1xA2x….An. In other words, the complexity of MARL scenarios increases with the number of agents in the environment.

While MARL systems are intrinsically distributed, we still can determine two main types of architecture: centralized and decentralized. Centralized MARL models relied on a controlling authority to manage the rewards for each one of the agents. This type of architecture is simpler to implement and relatively trivial to coordinate goals across the different agents but it results computationally expensive to operate at scale and, what’s most important, discourages autonomy. At the end, if the rewards of an agent are controlled by a centralized authority we can’t quite claim that the agent is autonomous, can we? 😉 That limitation of centralized MARL models puts it in direct contradiction with systems in which agents are incentivized to act autonomously. Think about stock markets in which traders are motivated by individual gains but still need to be mindful of the counterparty risk. That type of architecture is better suited for decentralized MARL models in which agents act autonomously and the coordination happens based on incentives.

MARL scenarios have enjoyed their share of success in the last few months with AI powerhouses like OpenAI building a system that can beat Dota2 and DeepMind doing the same on the Quake III game. However, in both scenarios the MARL environment only involved a small number of agents. Until now, MARL methods have struggled when applied on scenarios involving a large number of agents. As the number of agents increases in a MARL system so does the complexity of the coordination between them. From that perspective, building an incentive model for large scale MARL systems remains one of the biggest challenges for the implementation of these novel architectures.

Braess’ Paradoxes and Nash Equilibriums

A way to illustrate the challenge of modeling incentives in MARL systems can be explained by a paradox outlined by German mathematician Dietrich Braes in 1968. Using an example of congested traffic networks, Braes explained that, counterintuitively, adding a road to a road network could possibly impede its flow (e.g. the travel time of each driver); equivalently, closing roads could potentially improve travel times. The official statement of the paradox is as follows:

“For each point of a road network, let there be given the number of cars starting from it and the destination of the cars. Under these conditions, one wishes to estimate the distribution of traffic flow. Whether one street is preferable to another depends not only on the quality of the road, but also on the density of the flow. If every driver takes the path that looks most favourable to them, the resultant running times need not be minimal. Furthermore, it is indicated by an example that an extension of the road network may cause a redistribution of the traffic that results in longer individual running times.”

The Braes paradox seems to challenge the gold-standard of multi-agent systems: The Nash equilibrium. Remember the 2001 movie A Beautiful Mind in which Russell Crowe explained the foundations of the Nash Equilibrium using a picturesque example of friends at a bar trying to gain the interest of an attractive woman:

If we all go for the blonde and block each other, not a single one of us is going to get her. So then we go for her friends, but they will all give us the cold shoulder because no one likes to be second choice. But what if none of us goes for the blonde? We won’t get in each other’s way and we won’t insult the other girls. It’s the only way to win. -A Beautiful Mind (2001)

What happens if we add a second beautiful blonde to that scenario. In theory, it should be an optimization as now the group has more options. However, if the second blonde is more attractive than the original one, it might cause all the participants to compete against each other even more aggressively causing even further delays(whatever that means in this scenario 😉). That’s a text-book example of the Braes paradox which occurs because the Nash equilibrium occurs when the agents respond optimally to one another which is not always the case in real world multi-agent systems.

The Braes paradox is incredibly relevant to MARL architectures as any optimization of the neural network can impact the way intelligent agents react to it. From that perspective, MARL systems should rely on gravitate towards states in which small changes in the incentives can translate into disproportional positive outcomes.

Smart Incentives

Prowler’s tackled the problem of optimizing incentives in MARL architectures using a novel approach that splits the problem into two parts. One part computes the agents’ best-response policies with a given set of reward functions. The other part finds the best set of modifications to the reward functions (or incentives) given the agents’ joint response. This approach decomposes the problem in a way that decentralizes the computation because the agents compute their best-response policies themselves. It involves multi-agent reinforcement learning to compute the Nash equilibrium and Bayesian optimization to compute the optimal incentive, within a simulated environment.

In the Prowler architecture, uses both MARL and Bayesian optimization in very clever ensemble to optimize the incentives in the network of agents.

  • MARL is used to simulate the agents’ actions and produce the Nash equilibrium behavior by the agents for a given choice of parameter by the meta-agent.
  • Bayesian optimization is used to select the parameters of the game that lead to more desirable outcomes. Bayesian optimizations find the best model based on randomness, which matches the dynamics of the system.

Prowler’s smart incentive models relies on an incentive designer to choose the reward function within a simulated game, played by the agents, that models the joint behavior of the agents. The goal of the incentive designer is to modify the set of agent reward functions for the sub-game that induces behavior that maximizes the system performance. Using feedback from the simulated sub-game in response to changes to the agents’ reward functions, the incentive designer can compute precisely the modifications to the agents’ rewards that produce desirable equilibria among self-interested agents of the real-world game. The simulated environment avoids the need for costly acquisition of feedback data from real-world environments whilst ensuring the generated agent behaviour is consistent with real-world outcomes.

Smart Incentives in Action

Prowler applied its smart incentive techniques to several fascinating MARL problems. In one scenario, a MARL model is attempting to distribute 2000 self-interested agents, each seeking to locate themselves at desirable points in space throughout some time horizon. The desirability of a region changes with time and decreases with the number of agents located within the neighborhood. For example, consider the scenario in which the agents are taxi drivers in a fleet then each driver (and their colleagues) may cluster around a football stadium when they know the game is due to finish and fans need a lift home. While that behavior might benefit some individual drivers, it is conducive to traffic congestion and will leave other points of the city without accurate traffic coverage.

Using the smart incentive model, the incentive designer introduces a reward modifier to incentivize agents to adopt a desired distribution. The result is that the 2000 drivers distributed themselves in an optimal manner maximizing the territory coverage.

MARL systems are one of the most fascinating areas of research in the deep learning space. As those architectures drive towards decentralization, the need for robust incentive models will become more relevant. Efforts like Prowler’s smart incentives are definitely a step in the right direction.