[Review] UCL_RL Lecture05 Model Free Control

previous post is here Lecture04


  1. Introduction
  2. On-policy monte carlo control
  3. On-policy Temporal Difference learning
  4. Off-policy learning
  5. summary


1. Introduction

model free control can solve the issues in MDP like,

  1. MDP model is unknown, but experience can be sampled
  2. MDP model is known, but is too big to use, except by samples

2. On and Off-policy Learning

1. On-policy (on-line) Learning: updating the policy while we are following it during the process, “learn on the job”
2. Off-policy learning: “look over someone’s shoulder”, learn from sample $µ$

nice explanation

on-policyの強化学習手法は価値の改善と方策の改善の両方を行い、その両方が相互依存関係にある。つまり改善された価値に基づいて新たな方策をつくり、その方策に基づいて価値を更新していくというループ構造である。これを一般化方策反復(generalized policy iteration; GPI)と呼ぶ。

$\epsilon$-Greedy Exploration
This method allows us to choose most of time optimal policies, but with the probability of $\epsilon$ we instead select an action at random.
Non Greedy action: $\frac{\epsilon}{|A(s)|}$
Greedy Action: $(1 - \epsilon) + \frac{\epsilon}{|A(s)|}$

\pi(a|s) = \left\{
\frac{\epsilon}{m} + (1 - \epsilon) & (if\space a^* = argmax_{a \in A} \space Q(s,a)) \\
\frac{\epsilon}{m} & (otherwise)

As for greedy algorithm, we just stick to the best one, and never try finding better one.
However, regarding to $\epsilon$-Greedy Exploration, like above, with the probability $(1 - \epsilon)$, we take greedy action, and otherwise choose an action at random.
By applying this to $\epsilon$-Greedy Policy Improvement, we get this theorem.
For any $\epsilon$-Greedy policy π, the $\epsilon$-Greedy policy $\pi^`$ with respect to $q_{\pi}$ is an improvement,

v_{\pi^`}(s) \geq v_{\pi}(s)\\

on line 3, it is saying max must be equal or at least better than the weighted average of sum of greedy policy!

q_{\pi}(s, \pi^`(s)) &= \sum_{a \in A} \pi^`(a|s) q_{\pi}(s, a)\\
&= \frac{\epsilon}{m} \sum_{a \in A}q_{\pi}(s, a) + (1 - \epsilon) max_{a \in A} \space q_{\pi}(s, a)\\
&\geq \frac{\epsilon}{m} \sum_{a \in A}q_{\pi}(s, a) + (1 - \epsilon) \sum \frac{\pi(a|s) - \frac{\epsilon}{m}}{1 - \epsilon} \space q_{\pi}(s, a)\\
&= \sum_{a \in A} \pi(a|s) q_{\pi}(s, a)\\
&= v_{\pi} (s)

GLIE(Greedy in the Limit with Infinite Exploration)
1. All state-action value pairs are explored infinitely many times: $\lim_{k \to \infty} N_k(s,a) = \infty$
2. The policy converges on a greedy policy:

\lim_{k \to \infty} \pi _k(a|s) = 1 \space (a = argmax_{a \in A} \space Q_k(s,a))

Monte Carlo Control
Simply saying, update $Q_t(s,a)$(action-value functions) gradually towards $R_t$.

Let $R_{total}$ be total reward.
at t = T
$R^{total} = r_t$

At t = T-1
$R_{t-1}^{total} = r_{t-1} + γr_{t}$
update $Q(s_{t-1}, a_{t-1})$ towards $R_{t-1}^{total}$

At t = T-2
$R_{t-2}^{total} = r_{t-2} + γr_{t-1} + γ^2r_{t}$
update $Q(s_{t-2}, a_{t-2})$ towards $R_{t-2}^{total}$
over and over again till the beginning!!

Screen Shot 2018-03-24 at 13.53.57.png

Psuedo Code
Screen Shot 2018-03-24 at 8.18.03.png

3. On-Policy Temporal Difference Learning

Updating Action Value Functions with SARSA
Screen Shot 2018-03-24 at 14.09.26.png

$Q(s,a) \leftarrow Q(s,a) + \alpha \space (R + \gamma \space Q(s',a') - Q(s,a))$

Psuedo Code
Screen Shot 2018-03-24 at 14.10.23.png

Under the conditions of GLIE, this converges.

n-Step SARSA
n-step SARSA updates Q-return towards the n-step Q-return.
n = 1 => $q^1_t = R_{t+1} + \gamma Q(S_{t+1})$
n = 2 => $q^2_t = R_{t+1} + \gamma R_{t+2} + \gamma ^2 Q(S_{t+1})$
n = $\infty$

q^{\infty}_t = R_{t+1} + \gamma R_{t+2} + \cdots \gamma ^n Q(S_{t+n})\\
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \space (q^n_t \space - \space Q(S_t, A_t))

Screen Shot 2018-03-24 at 14.29.59.png


Screen Shot 2018-03-26 at 23.25.14.png

Off-Policy Learning

What makes off policy different from on-policy learning??
First of all, there's no reason that an agent has to do the greedy action; Agents can explore or they can follow options. This is not what separates on-policy from off-policy learning.
The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the greedy action a′. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.
The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′ and the current policy's action a″. It estimates the return for state-action pairs assuming the current policy continues to be followed.
The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.

Importance Sampling

PRML: Importance Sampling
Screen Shot 2018-03-26 at 23.58.04.png
Screen Shot 2018-03-26 at 23.58.18.png

Expectation Formula
$E[f(x)] = \int f(x)p(x) dx$
Importance Sampling Formula
$ ∫f(x)π(x)dx=∫f(x)\frac{π(x)}{q(x)}q(x)dx$
$q(x)$ is called proposal distribution which is approximation of original distribution.
$\frac{π(x)}{q(x)}$ is called importance weight: to modify the biases arising in the difference of $q(x)$ from original distribution($π(x)$).

Screen Shot 2018-03-26 at 23.54.42.png

*if you want study further, please check Likelihood weighted Sampling.

Importance Sampling for Off-Policy Monte-Carlo

Screen Shot 2018-03-27 at 22.38.40.png

Weighted importance sampling for off-policy learning with linear function approximation
Authors: A. Rupam Mahmood, Hado van Hasselt, Richard S. Sutton

Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, weighted importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(λ). We show empirically that our new WIS-LSTD(λ) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(λ) (Yu 2010, Bertsekas & Yu 2009).

Importance Sampling for Off-Policy TD

Screen Shot 2018-03-27 at 22.46.28.png


Screen Shot 2018-03-29 at 0.18.12.png
Screen Shot 2018-03-29 at 0.18.42.png
Screen Shot 2018-03-29 at 0.19.00.png
Screen Shot 2018-03-29 at 0.19.17.png

Relationship Between DP and TD

Screen Shot 2018-03-29 at 0.20.28.png
Screen Shot 2018-03-29 at 0.20.40 1.png