OpenAI GymのCopy-v0をQ学習で解く #Python

主旨

OpenAI Gym¹の課題Copy-v0の解法です．

下記記事の続きで，Q学習で解いています．

解法

Q学習

Q(s,a) \leftarrow Q(s,a) + \alpha \left\{ r(s,a,s') + \gamma\max_{a}Q(s',a') - Q(s,a) \right\} \\
r(s,a,s') = \mathbb{E}[R_{t+1} | S_t=s, A_t=a, S_{t+1}=s']

$\epsilon$-greedy法

Q学習の導出²

\begin{align}
Q(s,a) &= r(s,a)+\gamma \sum_{s' \in S} \max_{a' \in A(s')} p(s'|s,a) Q(s',a') & \\
       &\simeq  r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') & (\because s' \sim p(s'|s,a), s'以外になる可能性は低いという前提) \\
       &\simeq  (1-\alpha)Q(s,a) + \alpha \left \{ r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') \right\} & (\because 平滑化) \\
       &=  Q(s,a) + \alpha \left \{ r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') - Q(s,a)) \right \}  &
\end{align}

コード³

import numpy as np
import gym
from gym import wrappers

def run(alpha=0.3, gamma=0.9):
    Q = {}
    env = gym.make("Copy-v0")
    env = wrappers.Monitor(env, '/tmp/copy-v0-q-learning', force=True)
    Gs = []
    for episode in range(10**6):
        x = env.reset()
        X, A, R = [], [], [] # States, Actions, Rewards
        done = False
        while not done:
            if (np.random.random() < 0.01) or (not x in Q):
                a = env.action_space.sample()
            else:
                a = sorted(Q[x].items(), key=lambda _: -_[1])[0][0]
            X.append(x)
            A.append(a)
            if not x in Q:
                Q[x] = {}
            if not a in Q[x]:
                Q[x][a] = 0
            x, r, done, _ = env.step(a)
            R.append(r)
        T = len(X)
        x, a, r = X[-1], A[-1], R[-1]
        Q[x][a] += alpha * (r - Q[x][a])
        for t in range(T-2, -1, -1):
            x, nx, a, r = X[t], X[t+1], A[t], R[t]
            Q[x][a] += alpha * (r + gamma * np.max(Q[nx].values()) - Q[x][a])
        G = sum(R) # Revenue 
        print "Episode: %d, Revenue: %d" % (episode, G)
        Gs.append(G)
        if np.mean(Gs[-100:]) > 25.0:
            break

if __name__ == "__main__":
    run()

スコア

Episode: 30229, Reward: 29

References

Brockman, OpenAI Gym, 2016. ↩
J.Scholz, Markov Decision Processes and Reinforcement Learning, 2013. ↩
namakemono, OpenAI Gym Benchmark, 2017. ↩

OpenAI GymのCopy-v0をQ学習で解く

主旨

解法

Q学習の導出2

コード3

スコア

References

Q学習の導出²

コード³