LoginSignup
0
4

More than 5 years have passed since last update.

OpenAI GymのCopy-v0をQ学習で解く

Last updated at Posted at 2017-08-12

主旨

OpenAI Gym1の課題Copy-v0の解法です.

下記記事の続きで,Q学習で解いています.

解法

  • Q学習
Q(s,a) \leftarrow Q(s,a) + \alpha \left\{ r(s,a,s') + \gamma\max_{a}Q(s',a') - Q(s,a) \right\} \\
r(s,a,s') = \mathbb{E}[R_{t+1} | S_t=s, A_t=a, S_{t+1}=s']
  • $\epsilon$-greedy法

Q学習の導出2

\begin{align}
Q(s,a) &= r(s,a)+\gamma \sum_{s' \in S} \max_{a' \in A(s')} p(s'|s,a) Q(s',a') & \\
       &\simeq  r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') & (\because s' \sim p(s'|s,a), s'以外になる可能性は低いという前提) \\
       &\simeq  (1-\alpha)Q(s,a) + \alpha \left \{ r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') \right\} & (\because 平滑化) \\
       &=  Q(s,a) + \alpha \left \{ r(s,a)+\gamma \max_{a' \in A(s')} Q(s',a') - Q(s,a)) \right \}  &
\end{align}

コード3

import numpy as np
import gym
from gym import wrappers

def run(alpha=0.3, gamma=0.9):
    Q = {}
    env = gym.make("Copy-v0")
    env = wrappers.Monitor(env, '/tmp/copy-v0-q-learning', force=True)
    Gs = []
    for episode in range(10**6):
        x = env.reset()
        X, A, R = [], [], [] # States, Actions, Rewards
        done = False
        while not done:
            if (np.random.random() < 0.01) or (not x in Q):
                a = env.action_space.sample()
            else:
                a = sorted(Q[x].items(), key=lambda _: -_[1])[0][0]
            X.append(x)
            A.append(a)
            if not x in Q:
                Q[x] = {}
            if not a in Q[x]:
                Q[x][a] = 0
            x, r, done, _ = env.step(a)
            R.append(r)
        T = len(X)
        x, a, r = X[-1], A[-1], R[-1]
        Q[x][a] += alpha * (r - Q[x][a])
        for t in range(T-2, -1, -1):
            x, nx, a, r = X[t], X[t+1], A[t], R[t]
            Q[x][a] += alpha * (r + gamma * np.max(Q[nx].values()) - Q[x][a])
        G = sum(R) # Revenue 
        print "Episode: %d, Revenue: %d" % (episode, G)
        Gs.append(G)
        if np.mean(Gs[-100:]) > 25.0:
            break

if __name__ == "__main__":
    run()

スコア

Episode: 30229, Reward: 29

References


  1. Brockman, OpenAI Gym, 2016. 

  2. J.Scholz, Markov Decision Processes and Reinforcement Learning, 2013. 

  3. namakemono, OpenAI Gym Benchmark, 2017. 

0
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
4