Decentralized Reinforcement Learning

Original article was published by Meet Gandhi on Artificial Intelligence on Medium

Decentralized Reinforcement Learning

Detailed overview of a new paradigm in Reinforcement Learning

Many associations in the world like the biological ecosystems, government and corporations are physically decentralized however they are unified in the sense of their functionality. For instance, a financial institution operates with a global policy of maximizing their profits, hence appearing as a single entity; however, this entity abstraction is an illusion, as a financial institution is composed of a group of individual human agents solving their optimization problems with our without collaboration. In deep reinforcement learning, the process which maximizes the objective function parameterizes the policy as a function from states to actions. The policy function parameters are fine-tuned depending on the gradients of the defined objective function. This approach is called the monolithic decision-making framework as the policy function’s learning parameters are coupled globally solely using an objective function.

Note how actions are chosen passively by the agent in the monolithic decision-making framework. Source:

Having covered a brief background of a centralized reinforcement learning framework, let us move forward to some promising decentralized reinforcement learning frameworks.

Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions

A class of decentralized RL algorithms which establishes the relationship between the society (global policy or super-agent) and the agent (actions) at different levels of abstraction is proposed in [1]. A societal decision-making framework is defined at the highest level of abstraction to understand the relationship between the optimization problem solved locally by the agent and the optimization problem solved globally by society. At each state, the agent at the local level bids in an auction and the auction’s winner transmutes the state from one to another; after which it sells the transformed state to other agents, hence propagating a chain of local economic transactions. A question now arises regarding the characteristics of the auction mechanism and the society which enables the implicit emergence of the global solution simply from the agents optimizing their auction utilities. A solution to the above question is provided at the second stage of abstraction through the cloned Vickrey society which ensures that the agents’ dominant strategy equilibrium matches the optimal policy of the society. The truthfulness property of Vickrey auction as detailed in [2] guarantees the aforementioned outcome. At the third level of abstraction, a class of decentralized reinforcement learning algorithms is proposed which leverages the agents’ auction utilities as optimization objectives and hence learns the societal policy which is global in space and time using only the credit assignment for agents’ learnable parameters which is held locally in space and time. The fourth level of abstraction involves the implementation of the clones Vickrey society utilizing the proposed decentralized reinforcement learning algorithms and coming up with an optimal set of design choices titled the credit conserving Vickrey implementation that performs the best at both the global and local levels.

Note how actions choose themselves actively as to when to activate in the societal decision-making framework. Source:

Societal Decision-making

Let us set up the societal based decision-making framework by relating the Markov Decision Processes (MDP) and auction mechanisms under a unifying notation.

In the MDP (global) environment, the input space is the state space S and the output space is the action space A. An agent represents a policy π: S → A. The transition function Τ: S × A → S, the reward function r: S × A → ℝ and the discount factor γ are used to specify an objective function as follows:

At a given state s, the agent performs the task of maximizing J(π) and hence finding the optimal action defined as follows:

In the auction (local) environment, the input space is the single auction item s and the output space is the bidding space 𝔅. Each of N agents bid for an auction item in a competition using its respective bidding policy ψᵢ: {s} → 𝔅. Let b be the vector representing the bids made by N agents, then the utility for each agent can be defined using vₛ: the vector signifying each agent’s evaluation of bidding item s and the auction mechanism comprising of the allocation rule X: 𝔅ⁿ → [0,1]ⁿ and pricing rule P: 𝔅ⁿ → ℝⁿ. The utility function can thus be specified as follows:

In auction theory, an auction is a Dominant Strategy Incentive Compatible (DSIC) if independent of other players’ bidding strategies in the auction, bidding on one’s evaluation is optimal. Vickrey auction [2], which sets Pᵢ(b) to the second-highest bid and Xᵢ(b)=1 if agent i wins and 0 respectively if agent i loses, is DSIC, meaning that the dominant strategy equilibrium happens when every agent bids truthfully. This results in maximizing social welfare ∑vᵢ×Xᵢ(b), hence choosing the bidder with the highest evaluation. The presence of DSIC property in Vickrey auction removes the need for agents to take into consideration other agents while performing their optimization, hence the decentralized reinforcement learning algorithm built using Vickrey auction operates non-cooperatively.


“A primitive (agent) ω is a tuple (ψ, ϕ), where ψ: S→𝔅 represents the bidding policy and ϕ: S→S represents the transformation. A society is a group of primitives denoted as ω^{1:N}.

Optimal bidding (as described in [3]): Assuming that at each state s the local auction allocates Xᵢ(b)=1 if agent i wins and Xᵢ(b)=0 if agent i loses. Then all primitives ω^i bidding their respective optimal societal Q-values as a whole produce an optimal global policy.

Dominant Strategies for Optimal Bidding (as described in [1]): If the valuations vᵢ for each state s are the most favourable societal Q-values, then the optimal global policy of the society coincides with the distinctive dominant strategy equilibrium of the primitives under the Vickrey mechanism.” — [1]

Economic Transactions Perspective

Market Economy Perspective with an explanation. Source:

Hence, currency in the above system is grounded in the rewards received and wealth distribution happens depending on what future primitives decide to bid for the fruits of the labour of information processing undertaken by past primitives with the aim of transforming one state to another.

The cloned Vickrey society. Source Figure 1 in [1]
A succinct explanation of the preceding figure describing the clove Vickrey society mechanism. Source: Figure 1 in [1]

On-Policy Decentralized Reinforcement Learning Algorithm

Source: Section 6 in [1]

After learning the inner-workings of the decentralized reinforcement learning algorithm utilizing the cloned Vickrey auction mechanism, if you want to know more about the experiments performed using the algorithm please refer Section 7 titled Experiments of [1].

The decentralized reinforcement learning algorithm described in [1] emulated the characteristics of non-cooperativeness due to the usage of the cloned Vickrey society. Moving forward in the succeeding section, we will get familiar with another decentralized reinforcement learning algorithm but in a cooperative control setting.

Deep Decentralized Reinforcement Learning for Cooperative Control

Formal Problem Definition

Given a discrete-time system f: X × U → U controlled by N agents denoted as

Source: Equation (1) in [4]

where xₖ ∈ X ⊆ ℝⁿ signifies the state of the system at time step k,
u_{i,k} ∈ Uᵢ ⊆ ℝ the control of agent i ∈ ℕ = {1,…, N} and U=U₁×U₂×…×U_{N} the joint control space. Each agent i ∈ ℕ gets a reward rᵢ from a reward function g: X × U → ℝ based on the current state xₖ and control u_{i,k}.

Source: Equation (2) in [4]

The objective of each agent is to adapt to his control law πᵢ: X → Uᵢ to maximize his long-term discounted reward under the control laws tuple
Π = (π₁, π₂,…, π_{N}) and γᵢ ∈ [0,1) as a discount factor

Source: Equation (3) in [4]

Hence, the deterministic game can be defined by a tuple
G = (X, f, U₁,…, U_{N}, g₁,…, g_{N}, γ₁,…, γ_{N}) and the problem can be described as:

Problem Statement

“Given the game G, each agent i∈ N knows the system dynamics f and his own reward function gᵢ. Furthermore, each agent i∈ N receives his current reward r_{i,k} at time step k and is able to deduce the previous controls u_{j,k−1},
∀ j∈ N\{i} of other agents but has no access to the current controls u_{j,k}, other agents’ control laws πⱼ, their reward functions gⱼ or actual rewards. In this setting, each agent i∈ N aims at adapting his control law πᵢ in order to maximize Vᵢ^{π} as defined above.” — [4]

Decentralized Cooperative Control Method

For solving the problem addressed in the previous section, Deep Decentralized Reinforcement Learning for Cooperative Control proposes a time-dependent Temporal Experience Replay (TER) to account for the non-stationary environment composed of several agents, instil the known dynamics of the system using Imagined Experience Replay (IER) and utilize variable-learning with Impact Q-Learning (IQL) to induce coordination. These mechanisms are modular, hence let us first introduce them and then later combine them.

Temporal Experience Replay (TER)

The basic notion behind proposing Temporal Experience Replay is to merge the idea of favouring more recent experiences with the probable sampling of experiences in experience replay according to a prioritization factor analogous to the Prioritized Experience Replay. But here our focus tends to be more towards more recent experiences, hence we introduce temporal prioritization τ which is proportional to the time passed since the collection of the state transitions.

Source: Equation (7) and (8) in [4]

Hence a new experience tuple needs to be generated by appending the current time step for instilling the above temporal prioritization factors.

Source: Equation (9) in [4]

However, the TER method described above leads to two major problems in practice: (1) Too much biasing towards more recent experiences may lead to overfitting of the agent’s policy, and (2) As τ needs to be calculated for each experience tuple at every time step, it increases the computational complexity of sampling process making TER infeasible in practice. To overcome the aforementioned issues, a two-step sampling process is introduced. First, a macro-batch 𝔅 of size B is extracted uniformly at random from the entire experience replay buffer M. Subsequently, a mini-batch T of size t such that t≪B is sampled from 𝔅 utilizing the temporal prioritized probabilities mentioned earlier. This solves both the problems as by sampling from 𝔅 of size B reduces the computational complexity of calculating the temporal priorities. Additionally, extraction of the mini-batch uniformly at random prevents the overemphasizing of recent experiences in the agent’s policy.

Moreover, we need to also account for the variable exploration rates εₖ of agents at distant stages of their training process in TER, in order to adapt to other agent’s policy changes. When other players’ policies start to converge it is more effective to lower the exploration. Hence, we propose an additional exploration rate dependency of B yielding a time-dependent macro-batch size Bₖ. During early training process i.e. εₖ ≈ 1, the experiences should be sampled uniformly at random which can be obtained by electing Bₖ close to the mini-batch size t, whereas during the later training process, i.e. once εₖ → 0, Bₖ should approach the final macro-batch size B. Subsequently, we have the following as macro-batch size:

Source: Equation (10) in [4]

Imagined Experience Replay (IER)

The modifications made to the Experience Replay by TER imparts stability to the training process so that the agents are less receptive to varying environmental dynamics. Also, TER avoids bias learning in favour of recent experiences, hence making the agents adaptable to the non-stationary environmental dynamics. An important assumption is made up until now that the environmental dynamics are disassociated from the other agents’ behaviour which is normally reasonable given that the agents are independent and decentralized. However, we know the system dynamics hence it is possible to make the environment stationary by marginalizing the other agents’ controls. This is the basic notion behind the Imagined Experience Replay (IER). Note that IER is simply utilized for simulating the experiences which might not happen under normal conditions of the environment. The regular experience tuple mentioned earlier can be replaced by utilizing the underlying system dynamics f and reward function gᵢ as follows:

Source: Equation (12) in [4]

where u_{-i} = {u₁,…, u_{i-1}, u_{i+1},…, uₙ} i.e. controls of other agents excluding agent i. Eventually, an imagined experience can be simulated by replacing the other agents’ controls u_{-i} by 0 as follows:

Source: Equation (13) in [4]

Imagined experiences are not actually stored in ERM but used for sampling exploration rate-dependent probability is used to determine whether an imagined experience is computed in addition to the observed experience. Initially, in the training process, agents mostly perform exploration of random controls. Thus, it is impossible to deduce other agents’ policies from sampling observations during the initial stage hence the application of IER to stabilize the training process is most useful; because in IER, experiences are simulated in such a way that only agent i interacts with the environment with no noise of exploration from other agents. As a consequence, the probability associated with the simulation of imagined experiences at the current time step k is proportional to εₖ (the current exploration rate), hence, P(k) ∼ εₖ.

Subsequently, in the later stages of the training process, the agents’ policies start to converge. More importance is given to the coordination between the agents and adaptation to partners’ policies than the exploration in the later stages. Hence IER in later stages of training can be used to induce coordination between agents by generating experiences which simulate cooperation between agents. The respective sampling probability is proposed as P_{coord}(k) ∼ (1 − εₖ). To induce coordination, imagined experiences need to be generated through IER such that the final algorithm entails mixed cooperative-competitive task types. The following three scenarios are proposed for achieving the above.

First scenario: Control of agent i is discarded, so that agent i can observe how other agents behave by themselves and whether the resulting environment transitions are beneficial. Source: Equation (14) in [4]
Second scenario: Control of agent i is set to be equal to the average of all other agents’ joint controls. Source: Equation (15) in [4]
Third scenario: Control of all other agents is set to be equal to the control of agent i. Source: Equation (16) in [4]

By generating imagined experiences from the above three scenarios, prospective possibilities of coordination are exhausted. The first scenario results into agents evaluating if being idle is acceptable whereas the second and third compute the effect whether agent i replicates the average control of other agents or whether all other agents should stick to the control of agent i.

Impact Q-Learning (IQL)

Previous modification to ER combated the non-stationary environment problem by stabilizing the same. Additionally, a mechanism which is often used in multi-agent reinforcement learning is the variable learning rates. Deep Decentralized Reinforcement Learning for Cooperative Control suggests a novel methodology for the task of multi-agent credit assignment using variable learning rates by attempting to correlate the variable learning rates to the individual contribution of an agent towards the retrospectively observed state transitions. For the same, a quantity called impact factor is introduced which signifies the agent’s contribution to the joint controls and hence the observed state transitions.

Impact factor for agent i at time step k. Source: Equation (17) in [4]

Assuming that in IQL, the agents share the same control space U₁ = . . . = U_{N} and that all agents’ controls manipulate the system equally, we can describe the update rule as follows:

IQL update rule. Source: Equation (18) in [4]

where 0 ≪ β ≪ σ ≪ α ≪ 1. Using the above IQL update rule, we can concur that when a suitably positive experience is observed, the Q-value estimate is increased heavily only if the agent can be given the credited for the occurrence of the event. Alternatively, when a negative experience occurs, the agent is mostly discouraged from transitioning to the corresponding state-action pair, if the agent is at the least in part responsible. Moreover, this accountability-driven kind of learning behaviour is required for agents to overcome the credit assignment challenge in a multi-agent setting.


After learning about the modular sections TER, IER and IQL, we are now ready to detail the final algorithm as follows:

Deep Decentralized RL Algorithm for cooperative control. Source: [4]

The only thing which is added and not explained in the earlier sections is the coordination coefficient ψ_{i,k} =sgn(). If the value of ψ_{i,k} come out to be 1, then agent i and the others are said to be coordinating positively, hence it is unnecessary to simulate the coordination experiences χ_{idle}, χ_{coop1} and χ_{coop2} as mentioned in IER. Moreover, the learning rate σ, which is mainly used for λ_{high} ≥ λ_{i,k} ≥ λ_{low} is replaced by the larger learning rate α. Hence, agents are persuaded to emphasize cooperative experiences during the training process. In the case that agents coordinate negatively, ψ_{i,k} equals -1. Here, in addition to the sampled experience χₖ, the artificial imagined experiences χ_{idle}, χ_{coop1} and χ_{coop2} are simulated, and consequently, the agent performs training on all of them. Here the lowest learning rate β is employed because the experiences on which the agents are trained have not actually happened but are solely imagined for the purposes of coordination. Thus, the strenuous computational task of simulating several coordination experiences can be reduced to a great extent and focused mostly on occurrences related to the highest expected training progress.

To learn more about the experiments and simulations performed using the aforementioned algorithm, I would urge you to refer to section 4 titled Results in [4].

At this point, we have covered two promising methodologies aiming for Decentralized Reinforcement Learning, one in a non-cooperative setting (“Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions”) and another one in a cooperative setting (“Deep Decentralized Reinforcement Learning for Cooperative Control”).


  1. Chang, Michael, et al. “Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions.” arXiv preprint arXiv:2007.02382 (2020).
  2. Vickrey, William. “Counterspeculation, Auctions, and Competitive Sealed Tenders.” The Journal of Finance, vol. 16, no. 1, 1961, pp. 8–37. JSTOR,
  3. Baum, E. B. Toward a model of mind as a laissez-faire economy of idiots. In ICML, pp. 28–36, 1996.
  4. Köpf, Florian, et al. “Deep Decentralized Reinforcement Learning for Cooperative Control.” arXiv preprint arXiv:1910.13196 (2019).