Introduction
This is the branch article about Deep Reinforcement Learning.
Main contribution is here.
https://qiita.com/Rowing0914/items/9c5b4ffeb15f4fc12340
In this article, I would like to summarise a famous research paper, deep q network.
Also, i would like to share the implementation as well!
Hope you will like it!
DQN in nutshell
Mnih et al (2015) have discovered the novel approach combining the deep neural network architecture supporting the function approximation. Before explaining the algo, let me sharply cover what things made their achievement great is the three points summarised below.
- Stabilisation/Convergence of training with experience replay
- Domain knowledge free model(no need to put any prior information or domain knowdge)
- Generalisation(DQN performs very well in most of Atari games)
Model arcitecture
Input(84x84x4 image) - > conv -> ReLU -> conv -> ReLU -> conv -> ReLU -> FC -> FC -> softmax -> output(action-value func)
Important Techniques
Reward format: 1 for positive, -1 for negative and 0 for neutral/unchanged
Frame Skip: denoted by $k$, this param means the frequency for the agent to check the image and decide the action.
Since I got stack at this point in the paper, let me elaborate more.
With frame skip param, agent will only do new action on every $k$ th frames.
During the skipped term, they just repeating the same action over and over again
!
But how to separate or process the frames in the game??
Then we have good example here quoted from Seita's Good Research
Firstly, please take a look at the chunk of images below.
These are the bare screenshots of the Atari-game.
And with the skipping frame, we can technically skip and ignore the consecutive three frames.
However, you might wonder if it's okay for the agent to loose such a significant amount of the information about the state of the game. Indeed, I had the same question as well!! And now I can tell you that it is okay because at least the implementation has taken rewards during the skipped term into account by summing them. So I would recommend you to refer to here for the implementation!
https://github.com/spragunr/deep_q_rl
As for the implementation of the reward handling, it is done in the method named _step in ale_experiment.py by the guy Daniel mentioned in the article!
def _step(self, action):
""" Repeat one action the appopriate number of times and return
the summed reward. """
reward = 0
for _ in range(self.frame_skip):
reward += self._act(action)
return reward
and this outcome, which reward is used in the method run_episode in the same script!
def run_episode(self, max_steps, testing):
"""Run a single training episode.
The boolean terminal value returned indicates whether the
episode ended because the game ended or the agent died (True)
or because the maximum number of steps was reached (False).
Currently this value will be ignored.
Return: (terminal, num_steps)
"""
self._init_episode()
start_lives = self.ale.lives()
action = self.agent.start_episode(self.get_observation())
num_steps = 0
while True:
reward = self._step(self.min_action_set[action])
self.terminal_lol = (self.death_ends_episode and not testing and
self.ale.lives() < start_lives)
terminal = self.ale.game_over() or self.terminal_lol
num_steps += 1
if terminal or num_steps >= max_steps:
self.agent.end_episode(reward, terminal)
break
action = self.agent.step(reward, self.get_observation())
return terminal, num_steps
Anyhow with skipping frames, we can ensure the more efficient learning.
FYI
if you didn't implement the skip frame, then it would be like this.
https://www.youtube.com/watch?v=N813o-Xb6S8
In contrast, if you did, it would be like this!
https://www.youtube.com/watch?v=p88R2_3yWPA
So the movement is a bit looking rocky!
Implementation