1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

(論文)Multi-task Deep Reinforcement Learning with PopArt(2018)

Last updated at Posted at 2018-10-19

Source

Introduction

There have been numerous successful cases where a reinforcement agent being able to perform at super human level at various games. However, while the agent retains the same network structure throughout, it needs to be expertly trained on the specific game they are currently trying to conquer. Despite the progress DeepMind has made with IMPALA, where the aim is to create an agent that can multi-task, the final performance can only achieve a 59.7% median human normalized score across 57 Atari games, and a 49.4% mean human normalized score across 30 DeepMind Lab levels (DmLab-30).

One particular reason is that different games has a different rewarding system, causing the agent to tunnel-vision on tasks that yield higher rewards. For example, if game A can only provided a maximum reward of +1 at each step, whereas game B offers +1000, it is without question that the model will be more inclined to solve for game B due to saliency.

To alleviate this issue, a method named PopArt was proposed, where the rewards in each game are scaled so that each game becomes of equal learning value. With the collaboration between PopArt and IMPALA, a single agent is now capable of playing 57 Atari games with above-human median performance (110% median human normalized) and 72.8% mean human normalized score on DmLab-30.

Implementation

The goal is to have a single agent to learn N different environments image.png. This can be formalised as a single large MDP whose state space is image.png where i is the task index. The task index is only used at training time, not during testing.

The agent is an actor-critic that utilizes deep neural network for both its policy and value network. The policy is updated using REINFORCE algorithm
image.png.
where vθ(St) is used as a baseline to reduce variance. It also uses multi-step return to further reduce the variance.
image.png
Gvt and Gπt are the estimated return of vπ(St) and Qπ(StAt) respectively.

The agent follows the IMPALA distributed structure (several actors distributed across different environments, with a single GPU updating all rollouts with a deep convolutional ResNet followed by a LSTM recurrent layer), and implements importance sampling to correct each error term in n-step return. This causes the network to become unbiased but has high variance as a downside. To reduce variance, most of the importance sampling ratio is clipped. This lead to the v-trace return:
image.png
Following this, the returns used by the value and policy updates are:
image.png

The network uses PopArt normalisation to derive a scale invariant algorithm for the actor-critic agent. In order to normalize both baseline and policy gradient updates, the value function vµ,σ,θ(S) is parameterised as the linear transformation of a suitably normalised value prediction nθ(S). The output from the deep neural net is the normalised value prediction:
image.png
µ and σ are updated with (6) in order to track mean and standard deviation of the values.
image.png
image.png
The fixed decay rate β determines the horizon used to compute the statistics. nθ(S), µ and σ are used to normalise the loss in both value and policy network of the actor critic. This results in the scale-invariant updates:
image.png
image.png
The normalised targets for values are non-stationary as they depend on statistics µ and σ. PopArt normalisation prevents this by updating the last layer of the normalised value network to preserve unnormalised value estimates vµ,σ,θ, under any change in the statistics µ -> µ' and σ -> σ':
image.png

A single pair of normalisation statistics is not enough for multitask RL settings. Let Di be an environment in some finite number i, and π(S|A) be a task-agnostic (it doesn't know which task it will get) policy that takes a state S from any of the environment Di, and maps it to a probability distribution onto the shared action space A. The multi-task value function v(S) has N outputs, one for each task. v can be parameterised the same as before using Equation 5, except the statistics µ and σ are now vectors, and the same holds true for the value function, which becomes image.png
image.png
where W and b are the weights and bias for the last fully connected layer in nθ(s). The scale invariant updates from Equation 7 and 8 now becomes:
image.png
where the targets image.png uses the value estimates for environment Di for bootstrapping.

W and b are updated in the same way when updating µ and σ
image.png
where wi is the ith row of matrix W, and µi, σi, bi are the ith elements of the corresponding parameter vectors. When updating, only the values, not the policy, are conditioned on the task index, so that the agent can perform without prior knowledge of the selected task. This is because values are only used to reduce the variance of the policy updates during training, not used for action selection.

To summarize, the gradients do not propagate into mean and std, as they are exclusively updated as in Equation 6. The weights W of the last layer of the value function are updated according to Equation 11 and 13. Note that the actor critic updates are applied first (11) before the update of the statistics (6), and finally followed by (13).

Result

The experiments are performed on Atari-57 and DmLab-30. In addition, an extra environment of Atari-57 without reward clipping is also introduced. Atari has a particularly diverse reward structure, while DmLab holds diverse tasks. The IMPALA baseline shown in this paper is also stronger than those reported in the original paper, as the agent has been equipped with a larger action set (e.g. giving it finer control when rotating).
image.png

image.png
Note that the performance is the median performance of a single agent across all games, rather than the median of a set of individually trained agents.

The PopArt-IMPALA was further extended to incooperate pixel control in order to improve data efficiency.
image.png
It can be seen in Figure 5 that pixel control augmented PopArt-IMPALA with only a fraction of the data. On top of this, since both PopArt and pixel control requires very small additional computational cost, the cost of training IMPALA agents is greatly reduced. Furthermore, any other orthogonal advances in deep RL can be combined to further improve performance, similar to those done by Rainbow.

image.png

Conclusion

The PopArt's adaptive normalisation has proven that it can allow a single agent to train across multiple tasks. On top of the performance gain, the data efficiency has also been improved. Further augmentations such as pixel control has shown to further boost data efficiency. Result might even be improved when incooperating other methods.

1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?