Detail of the paper
Title: Neural Fitted Q Iteration - First Experiences
with a Data Efficient Neural Reinforcement
Learning Method
Published Year: 2005
author: Martin Riedmiller
Link: http://ml.informatik.uni-freiburg.de/former/_media/publications/rieecml05.pdf
Introduction
This paper introduced the novel approach which combined the classical Q-learning and multi-layer perceptron. And this is one of the origin of the further development in the domain of Deep Reinforcement Learning.
Basically this is a function approximation using multi-layer perceptron to efficiently represent Q function.
And the author firstly clarified the advantages and disadvantages arising in this approach reported by other existing research conducted by [Tes92, Lin92, Rie00].
- Advantages: Generalisation: since MLP can affect effectively globally, the model can represent the Q function efficiently compared to other local-oriented approach
- Disadvantage: Danger of the divergence in Learning: to widely cover the observations to train the model globally, this is not be assured the convergence in learning.
Basic Concept
The basic idea underlying NFQ is the following: Instead of updating the neural
value function on-line (which leads to the problems described in the previous
section), the update is performed off-line considering an entire set of transition
experiences. Experiences are collected in triples of the form (s, a, s
) by interacting
with the (real or simulated) system1. Here, s is the original state, a is the
chosen action and s is the resulting state. The set of experiences is called the
sample set D.
Algorithm

Benchmarking
- avoidance control task - keep the system somewhere within the ’valid’ region
of state space. Pole balancing is typically defined as such a problem, where
the task is to avoid that the pole crashes or the cart hits the boundary of
the track. - reaching a goal - the system has to reach a certain area in state space. As soon
as it gets there, the task is immediately finished. Mountaincar is typically
defined as getting the cart to a certain position up the hill. - regulator problem - the system has to reach a certain region in state space
and has to be actively kept there by the controller. This corresponds to the
problems typically tackled with methods of classical control theory.
Empirical Results
- Pole Balancing Task


Conclusion
The author has concluded that utilising the experience reply, we can avoid the issues discussed above, and achieve the efficient training.
References
BM95] Boyan and Moore. Generalization in reinforcement learning: Safely approximating
the value function. In Advances in Neural Information Processing
Systems 7. Morgan Kaufmann, 1995.
[EPG05] D. Ernst and and L. Wehenkel P. Geurts. Tree-based batch mode reinforcement
learning. Journal of Machine Learning Research, 6:503–556, 2005.
[Gor95] G. J. Gordon. Stable function approximation in dynamic programming. In
A. Prieditis and S. Russell, editors, Proceedings of the ICML, San Francisco,
CA, 1995.
[Lin92] L.-J. Lin. Self-improving reactive agents based on reinforcement learning,
planning and teaching. Machine Learning, 8:293–321, 1992.
[LP03] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of
Machine Learning Research, 4:1107–1149, 2003.
[RB93] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation
learning: The RPROP algorithm. In H. Ruspini, editor, Proceedings
of the IEEE International Conference on Neural Networks (ICNN), pages
586 – 591, San Francisco, 1993.
[Rie00] M. Riedmiller. Concepts and facilities of a neural reinforcement learning control
architecture for technical process control. Journal of Neural Computing
and Application, 8:323–338, 2000.
[SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, Cambridge,
MA, 1998.
[Tes92] G. Tesauro. Practical issues in temporal difference learning. Machine Learning,
(8):257–277, 1992.