2
3

More than 3 years have passed since last update.

論文 Making Efficient Use of Demonstrations to Solve Hard Exploration Problems (2019)

Posted at

Source

Introduction

Main Challenges:

  • Sparse rewards - Hard to explore.
  • Partially observable environments - Forces the use of memory, reduce generality of information by a single demonstration trajectory.
  • Highly variable initial conditions - Generalizing between different initial conditions is difficult.

Hard-Eight Task Suite

  1. Baseball IMAGE ALT TEXT
  2. Drawbridge IMAGE ALT TEXT
  3. Navigate Cubes IMAGE ALT TEXT
  4. Push Blocks IMAGE ALT TEXT
  5. Remember Sensor IMAGE ALT TEXT
  6. Throw Across IMAGE ALT TEXT
  7. Wall Sensor IMAGE ALT TEXT
  8. Wall Sensor Stack IMAGE ALT TEXT

Recurrent Replay Distributed DQN from Demonstrations (R2D3)

  • Outperforms behavioral cloning and expert demonstrations.
  • Uses expert demonstrations to guide agent exploration.

Implementation

The overall system design of R2D3 can be seen in Figure 1. There exists multiple actors, each with a copy of the behavior policy, that stream its experience to a shared and globally prioritized agent replay buffer. The actors update their network weights periodically from the learner (This is akin to R2D2). The learner is a n-step Dueling Double DQN, which also updates the priorities in the replay buffer.

image.png
Figure 1. R2D3 architecture.

For each entry in the replay memory, instead of the usual (s, a, r, s'), we store a fixed length (m=80) sequences of (s, a, r), with adjacent sequences overlapping each other by 40 time steps, and never crossing episode boundaries.

The input consists of 4 frame stacks, and instead of reward clipping, the algorithm opts for the adaptation of an invertible value function image.png (with $h(x)^{-1} = sign(x)((x+sign(x))^2 - 1)$).

This leads to the n-step targets for the Q-value function:image.png
where $\theta^-$ is the target network parameters that are copied from the online network parameters $\theta$ every 2500 learner steps.

The replay prioritization uses the max and mean absolute n-step TD-errors $\delta_i$ over the sequence image.png with $\eta$ being the priority exponent.

There is a second demo replay buffer, which contains prioritized records of expert demonstrations. The learner can then sample batch replays from the 2 replay buffers simultaneously.

The hyperparameter $p$ is the demo-ratio, which denotes the proportion of data sampled from each replay buffer. The sampling is done at a batch level, meaning samples in one batch can be from different sources.

image.png
image.png

Experiment

Input feature

The input feature representation is computed using the architecture shown in Figure 2.
image.png
Figure 2. Architecture for input feature representation

The input frame of size 96x72 is fed into a ResNet, and its output is concatenated with previous action $a_{t-1}$, previous reward $r_{t-1}$ and other proprioceptive feature $f_t$, such as accelerations, whether the avatar hand is holding an object, and the hand's relative distance to the avatar.

Baselines

  1. Behavior Cloning (BC) - Supervised imitation learning on expert trajectories.
  2. Recurrent Replay Distributed DQN (R2D2) - R2D3 without demonstration. Off-policy SOTA.
  3. Deep Q-learning from Demonstration (DQfD) - Replace recurrent value function of R2D3 with feed-forward reactive network (Figure 3). LfD SOTA.

image.png

Figure 3. (a) is the recurrent head used by R2D3, whereas (b) is the feedforward head used by DQfD.

Setup

100 expert demonstrations per task.

R2D3, R2D2 and DQfD
* Adam optimizer
* Learning rate 2x10$^{-4}$,
* Distributed training with 256 parallel actors, at least 10 billion actor steps for all tasks.
BC
* Adam optimizer
* Learning rate {10$^{-5}$, 10$^{-4}$, 10$^{-3}$}
* Trained for 500k learner steps.

An agent is considered successful if it solves at least 75% of its final 25 episodes. Note that a successful agent may still fail depending on the environment randomization.

Result

From Figure 4, it can be gathered that:
1. None of the baseline algorithms succeed in any of the eight environments.
2. R2D3 learns six out of the eight tasks, exceeding human performance in four of them.
3. All algorithms fail to solve two of the tasks: Remember Sensor and Throw Across, which are the most demanding in terms of memory requirement for the agent.
image.png
Figure 4. Reward vs actor steps curves for R2D3 and baselines on the Hard-Eight task suite.

From Figure 5, it can be deduced that lower demo ratios outperform the higher demo ratios.
image.png
Figure 5. R2D3 success rate vs demo ratio

At 5 billion steps (well before R2D3 solves the task), it can be seen that R2D3 is sufficiently guided by expert demonstrations to perform actions which lead to successful outcome (Figure 6).
image.png
Figure 6. Guided exploration behavior in the Push Blocks task.

R2D3 significantly outpaces R2D2 at tackling the Baseball task (Figure 7), eventually beating human average.
image.png
Figure 7. Guided exploration behavior in the Baseball task.

IMAGE ALT TEXT
IMAGE ALT TEXT

R2D3 was able to exploit a bug in the Wall Sensor Stack task. This exploitation was not present in any of the demonstrations, although the authors fail to mention whether other baselines were able to find and exploit this bug.
IMAGE ALT TEXT

Conclusion

Combining agent self-exploration with expert demonstration achieves good result in partially observable environments with sparse rewards and highly variable initial conditions, despite the ratio of learning from demonstrations is generally minute.

2
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
3