Learning Attentional Communication for Multi-Agent Cooperation (NIPS 2018)

Introduction

Communication can help multiagents learn to cooperate with each other. However, when the size of agents is large, there may exist some useless information that may impair the learning of cooperation. For this reason, this work proposes an attentional communication model that can filter out the unnecessary shared information for cooperative decision making. Inspired by the recurrent models of visual attention, this work delivers an attention unit that receives encoded local observation and action intention of an agent to determine whether it is needed to cooperate with others in its observable field. In other words, each agent called initiator, selects collaborators to form a communication group.

A bidirectional LSTM is used as the communication channel to connect each agent within a communication group. The LSTM unit takes the hidden states (e.g. the encoding of local observation and action intention) and returns thoughts that guide agents for coordinated strategies. Unlike CommNet [1] and BiCNet [2] that perform arithmetic mean and weighted mean of hidden states respectively, this LSTM unit selectively outputs critical information for cooperative decision making to lead to the dynamic communication.

This work should be the first work that applies attentional communication sucessfully to the multiagent reinforcement learning.

Settings

This model is shown to be successful in cooperative environments (e.g. cooperation with a global reward), mixed-cooperative environments (e.g. cooperation with local rewards), and competitive environments (e.g. competition with local rewards) respectively.

All of agents share the parameters of the policy network, the critic network, attention units and the communication channel. Therefore, it can be deployed to the large scale of multiagent system easily.

Model

First, each agent is assumed to live in a partial observation environment, where each agent $i$ receives a local observation $o_{t}^{i}$ correlated with the state $s_{t}$ at time t. A policy network extracts a hidden layer as thought taking the input as local observation, which encodes both of the local observation and the action intention, represented as $h_{t}^{i} = \mu_{I}(o_{t}^{i}; \theta^{\mu})$. Every $T$ time steps, the attentional unit takes the $h_{t}^{i}$ as input and determines whether the communication is necessary. It then, the agent, namely initiator selects other agents, called collaborators, in its field to form a communication group. This communication group keeps the same in $T$ time steps. The value of $T$ can be tuned as a hyperparameter to determine when and how long to communicate by the attention unit. The communication channel connects each agent of the communication group, takes as input the thought of each agent and ouputs the integrated thought that guides the agents to generate coordinated actions. The integrated thought $\tilde h_{t}^{i}$ is merged with $h_{t}^{i}$ and fed into the rest of the policy network. After that, the policy network returns the action $a_{t}^{i} = \mu_{II}(h_{t}^{i}, \tilde h_{t}^{i}; \theta^{\mu})$. $\mu_{I}$ and $\mu_{II}$ are two partitions of the actor network (or the policy network).

As for the attention model, the attention unit only considers the encoding of local observations and the action intention of an agent so as to decide whether the communication is necessary according to cooperation. In details, the attention unit takes the input as the thought representations and produces the probability of the agents within the observable field to be attention (e.g. the probability of communication). Different from the existing work, e.g. CommNet and BiCNet, which always considers the communication among all of agents, this work can conduct the dynamic communication when it is necessary.

Communication

When an initiator selects its collaborators, it only considers the agents in its observable field and ignores those who cannot be perceived. Within the observable field of an initiator, there exists three types of agents: other initiators, agents who have been selected by other initiators, and agents who have not been selected.

In this work, the communication bandwidth is restricted, which means each initiator can only select m collaborators. The initiator firstly choose collaborators from agents who have not been selected, then from agents selected by other initiators, finally from other initiators. When an agent is selected by multiple initiators, it will participate in more than one communication group.

If an agent $k$ is selected by two initiators $p$ and $q$ sequentially, agent $k$ cooperate with $p$’s group firstly. The communication channel integrates their thoughts: $\{ \tilde h_{t}^{p}, …, \tilde h_{t}^{k’} \} = g(h_{t}^{p}, …, h_{t}^{k})$. Then agent $k$ communicate with $q$’s group: $\{ \tilde h_{t}^{q}, …, \tilde h_{t}^{k’’} \} = g(h_{t}^{q}, …, \tilde h_{t}^{k’})$. The agent shared by different groups can fill in the information gap among individual groups. It can propagate the information throughout the groups and lead to the coordinated decisions among the whole groups.

The bidirectional LSTM unit acts as the communication channel. It can selectively integrate the information of agents within one group by gating functions instead of integrating information of agents by arithmetic mean and weighted mean [1,2].

The figure describing the whole process is shown as below:

Step 1

Reference

[1] Learning Multiagent Communication with Backpropagation

[2] Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games

[3] Learning Attentional Communication for Multi-Agent Cooperation

Written on February 4, 2019