63

More than 5 years have passed since last update.

強化学習の系図

Last updated at 2018-10-26Posted at 2018-10-23

強化学習を一旦掘り下げられるよう整理したかったので，

いつ
だれが
(どんな問題を解いたアルゴリズムで)
何の略称で
親は誰なのか

系図を作ってまとめてみました．(取り違えているかもしれません)

図: 主要な強化学習アルゴリズムの系図(左上の数字は誕生年)

TD(λ)(Sutton, 1984;1988)

Temporal Differences
Sutton, Richard S. "Learning to predict by the methods of temporal differences." Machine learning 3.1 (1988): 9-44.

Q学習(Watkins, 1989)

Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8.3-4 (1992): 279-292.

REINFORCE(Williams, 1992)

REward Increment = Nonnegative Factor Offset Reinforcement
Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8.3-4 (1992): 229-256.

SARSA(Rummery, 1994)

State-Action-Reward-State-Actionの略称(論文の注釈で登場)
Rummery, Gavin A., and Mahesan Niranjan. On-line Q-learning using connectionist systems. Vol. 37. Cambridge, England: University of Cambridge, Department of Engineering, 1994.
cf. sugulu, Qiita -【強化学習初心者向け】シンプルな実装例で学ぶSARSA法およびモンテカルロ法【CartPoleで棒立て：1ファイルで完結】

CEM(Rubenstein, 1997)

Cross Entropy Method
進化的アルゴリズム
参考1: http://web.mit.edu/6.454/www/www_fall_2003/gew/CEtutorial.pdf
参考2: https://esc.fnwi.uva.nl/thesis/centraal/files/f2110275396.pdf
参考3: http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf

AL(Baird Ⅲ, 1999)

Advantage Learning
Baird III, Leemon C. Reinforcement learning through gradient descent. No. CMU-CS-99-132. CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE, 1999.

AC(Sutton et al., 2000)

Actor-Critic
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.

NFQ(Riedmiller, 2005)

Neural Fitted Q Iteration
Riedmiller, Martin. "Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method." European Conference on Machine Learning. Springer, Berlin, Heidelberg, 2005.

REPS(Peters et al., 2010)

Relative Entropy Policy Search
Peters, Jan, Katharina Mülling, and Yasemin Altun. "Relative Entropy Policy Search." AAAI. 2010.

DQN(Mnih et al., 2013)

Deep Q-Networks
Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
cf. sugulu, Qiita -【強化学習初心者向け】シンプルな実装例で学ぶQ学習、DQN、DDQN【CartPoleで棒立て：1ファイルで完結、Kearas使用】

DPG(Silver et al., 2014)

Deterministic Policy Gradient
Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.

DDQN(Hasselt et al., 2015)

Double DQN
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. Vol. 2. 2016.
cf. sugulu, Qiita -【強化学習中級者向け】実装例から学ぶDueling Network DQN 【CartPoleで棒立て：1ファイルで完結】

Dueling DQN(Wang et al., 2015)

Double DQNの進化版
Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." arXiv preprint arXiv:1511.06581 (2015).

PAL(Bellemare et al., 2015)

Persistent Average Learning
Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." arXiv preprint arXiv:1511.06581 (2015).

TRPO(Schulman et al., 2015)

Trust Region Policy Optimization
Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

DDPG(Lillicrap et al., 2015)

Deep DPG
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

PER(Schual et al., 2015)

Prioritized Experience Replay
Schaul, Tom, et al. "Prioritized experience replay." arXiv preprint arXiv:1511.05952 (2015).
sugulu, 【強化学習中級者向け】実装例から学ぶ優先順位付き経験再生 prioritized experience replay DQN 【CartPoleで棒立て：1ファイルで完結】

Gorila(Nair et al., 2016)

General Reinforcement Learning Architecture
Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. 2016.

NAF(Gu et al., 2016)

Normalized Advantage Function
行動空間が連続な問題でも解けるようDouble DQNを拡張
Gu, Shixiang, et al. "Continuous deep q-learning with model-based acceleration." International Conference on Machine Learning. 2016.

A3C(Mnih et al., 2016)

Asynchronous Advantage Actor-Critic
Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. 2016.
cf. sugulu, Qiita -【強化学習】実装しながら学ぶA3C【CartPoleで棒立て：1ファイルで完結】

A2C(Mnih et al., 2016)

Advantage Actor-Critic
アイデア: https://blog.openai.com/baselines-acktr-a2c/
Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International conference on machine learning. 2016.

UNREAL(Jarderberg et al., 2016)

UNsupervised REinforcement and Auxiliary Learning
Jaderberg, Max, et al. "Reinforcement learning with unsupervised auxiliary tasks." arXiv preprint arXiv:1611.05397 (2016).

GAIL(Ho et al., 2016)

Generative Adversarial Imitation Learning
Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.

ACER(Wang et al., 2016)

Actor-Critic with Experience Replay
Wang, Ziyu, et al. "Sample efficient actor-critic with experience replay." arXiv preprint arXiv:1611.01224 (2016).

HER(Andrychowicz et al., 2017)

Hindsight Experience Replay
Andrychowicz, Marcin, et al. "Hindsight experience replay." Advances in Neural Information Processing Systems. 2017.
cf. ishizakiiii, Qiita - 失敗からも学ぶ強化学習 HERのアルゴリズムを理解して、OpenAI Gymの新しいロボットで試してみた

C51(Bellemare et al., 2017)

Rainbowの特殊系
A Distributional Perspective on Reinforcement Learning
Bellemare, Marc G., Will Dabney, and Rémi Munos. "A distributional perspective on reinforcement learning." arXiv preprint arXiv:1707.06887 (2017).

Rainbow (Hessel et al., 2017)

Hessel, Matteo, et al. "Rainbow: Combining improvements in deep reinforcement learning." arXiv preprint arXiv:1710.02298 (2017).
cf. Jun Okumura, SlideShare - DQNからRainbowまで〜深層強化学習の最新動向〜

table1

table2

table4

PPO(Schulman et al., 2017)

Proximal Policy Optimization
Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
cf. sugulu, Qiita -【強化学習】実装しながら学ぶPPO【CartPoleで棒立て：1ファイルで完結】

Song et al., 2018より

ACKTR(Wu et al., 2017)

Actor Critic using Kronecker-Factored Trust Region
Wu, Yuhuai, et al. "Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation." Advances in neural information processing systems. 2017.

SDQN(Metz et al., 2017)

Sequential DQN
Metz, Luke, et al. "Discrete sequential prediction of continuous actions for deep RL." arXiv preprint arXiv:1705.05035 (2017).

QR-DQN(Dabney et al., 2017)

Quantile Regression DQN
Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." arXiv preprint arXiv:1710.10044 (2017).

IQN(Dabney et al., 2018)

Implicit Quantile Network
Dabney, Will, et al. "Implicit Quantile Networks for Distributional Reinforcement Learning." arXiv preprint arXiv:1806.06923 (2018).

Ape-X(Horgan et al. 2018)

DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY
参考: 【深層強化学習】『2018年最強手法(?)』Ape-X 実装・解説, Qiita.

References

GitHub - Deep Reinforcement Learning Papers

63

Register as a new user and use Qiita more conveniently

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

63