# LEARNING WHEN TO COMMUNICATE AT SCALE IN MULTIAGENT COOPERATIVE AND COMPETITIVE TASKS (ICLR 2019)

## Introduction

Recent works shows that the continuous communication can produce the efficient learning with backpropagation in multiagent scenarios, but restricted to the fully cooperative tasks (with the global reward). This work produce a model namely IC3Net that controls continuous communication with a gating mechanism and uses individualized rewards for each agent to acquire better performance and scalability while fixing credit assignment issues.

The main contribution of this work is to deliver a gating mechanism to make agents select the necessary targets to communicate. In their explanation, this actually always happens in real life, where humans have authority to choose the person to communicate in various scenarios. Accordingly, this model can actually work for any sort of scenarios whether it is competitive or cooperative. Additionally, this model is demonstrated that it can perform far better than other models when the scale of agents is increased.

## Settings

This model can solve out almost all of categories of scenarios, e.g. cooperative scenario, mixed-cooperative scenario and competitive scenario. Cooperative scenario is the scenario that all of agents work together to maximize a global reward; mixed-cooperative scenario is the scenario that the agents cooperate together but maximize their own reward; competitive scenario is the scenario that agents competing with each other to improve their own rewards.

## Model

Prior to the presentation of IC3Net model, an independent controller model is introduced. The policy of each agent $j$ is shown as equations below:

where $o_{j}^{t}$ is the observation of the agent $j$ at time $t$, $e(\cdot)$ is an encoder function parameterized by a fully-connected neural network and $\pi$ is an agent’s action policy. Additionally, $h_{j}^{t}$ and $s_{j}^{t}$ are the hidden and cell states of the LSTM respectively.

Although each agent is controlled by an individual LSTM, the models are the same and the parameters are shared. This can avoid the defect caused by the different permutations of agents.

Based on the independent controller model, IC3Net is extended to allow agents to communicate their internal state, gated by a discrete action. The detailed policy of each agent $j$ is shown as equations below:

$g_{j}^{t+1} = f^{g}(h_{j}^{t}) \tag{3}$ $h_{j}^{t+1}, s_{j}^{t+1} = \text{LSTM}(e(o_{j}^{t})+c_{j}^{t}, h_{j}^{t}, s_{j}^{t}) \tag{4}$ $c_{j}^{t+1} = \frac{1}{J-1} C \sum_{j' \neq j} h_{j'}^{t+1} \odot g_{j'}^{t+1} \tag{5}$ $a_{j}^{t} = \pi(h_{j}^{t}), \tag{6}$

where $c_{j}^{t}$ is the communication vector for the agent $j$, $C$ is a linear transformation matrix to transform the gated average hidden state to a communication tensor, $J$ is the number of alive agents currently present in the system and $f^{g}(\cdot)$ is a simple network containing a softmax layer for 2 actions (conducting communication or not) on top of a linear layer with non-linearity. The binary action $g_{j}^{t}$ specifies whether agent $j$ would like to communicate with others, and plays the role of a gate function when calculating the communication vector. Both of the action policy $\pi$ and the gate function $f^{g}$ are trained by REINFORCE (policy gradient).

Compared with the interconnection among the whole agents, regarded as one big network trained by a global reward to make decisions for agents in [1], this work can allocate each agent a big network with the shared parameters. Each network is consisted of multiple LSTMs, each processing an observation of a single agent. However, only one LSTM needs to output a correlated action in each big network since it just controls one agent. Albeit the fantastic view of how to train agents, it is difficult in actual implementation. Therefore, in practice, only one single big network is trained but with individual reward for each agent rather than a global reward in [1].

To sum up, there are two benefits of this model:

1. Due to the option of selecting the appropriate agent to communicate, this model can work in either cooperative scenarios or competitive scenarios.
2. Thanks to the individual controller for each agent, the credit assignment problem (that how to assign the contribution of each agent in a global reward setting) has been solved out.

The whole architecture of this model can be shown as the figure below:

## Reference

[1] Learning Multiagent Communication with Backpropagation

[2] Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks

Written on January 30, 2019