More than 5 years have passed since last update.

つくりながら学ぶ！深層強化学習_1

Last updated at 2019-12-18Posted at 2019-12-18

#深層強化学習　〜Pytorchによる実践プログラミング〜

理系大学院修士1年のはりまです。
自分の学習内容をメモ程度にまとめていきます。見辛いのは申し訳ありません。
分からないところは教えていただきたいです。

実装コード（GitHub）
https://github.com/YutaroOgawa/Deep-Reinforcement-Learning-Book

Chap.2 迷路課題に強化学習を実装しよう

2.1 Pythonの使い方

2.2 迷路とエージェントを実装

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

fig=plt.figure(figsize=(5,5))
ax=plt.gca()


plt.plot([1,1],[0,1],color='red',linewidth=2)
plt.plot([1,2],[2,2],color='red',linewidth=2)
plt.plot([2,2],[2,1],color='red',linewidth=2)
plt.plot([2,3],[1,1],color='red',linewidth=2)


plt.text(0.5,2.5,'S0',size=14,ha='center')
plt.text(1.5,2.5,'S1',size=14,ha='center')
plt.text(2.5,2.5,'S2',size=14,ha='center')
plt.text(0.5,1.5,'S3',size=14,ha='center')
plt.text(1.5,1.5,'S4',size=14,ha='center')
plt.text(2.5,1.5,'S5',size=14,ha='center')
plt.text(0.5,0.5,'S6',size=14,ha='center')
plt.text(1.5,0.5,'S7',size=14,ha='center')
plt.text(2.5,0.5,'S8',size=14,ha='center')
plt.text(0.5,2.3,'START',ha='center')

plt.text(2.5,0.3,'GOAL',ha='center')


ax.set_xlim(0,3)
ax.set_ylim(0,3)
plt.tick_params(axis='both',which='both',bottom='off',top='off',
                labelbottom='off',right='off',left='off',labelleft='off')


line, =ax.plot([0.5],[2.5],marker="o",color='g',markersize=60)

迷路の全体図ですね。

・エージェントがどのように行動するのかを定めたルールのことを**方策（Policy）**と呼ぶ

・$\pi_\theta(s,a)$と表す

・状態$s$のときに行動$a$を採用する確率はパラメータ$\theta$で決まる方策$\pi$に従う

theta_0 = np.array([[np.nan, 1, 1, np.nan],
                    [np.nan, 1, np.nan, 1],
                    [np.nan, np.nan, 1, 1],
                    [1, 1, 1, np.nan],
                    [np.nan, np.nan, 1, 1],
                    [1, np.nan, np.nan, np.nan],
                    [1, np.nan, np.nan, np.nan],
                    [1, 1, np.nan, np.nan]
                    ])

・パラメータ$\theta_0$を変換して方策$\pi_\theta(s,a)$を求める


def simple_convert_into_pi_fron_theta(theta):

     [m,n] = theta.shape
     pi = np.zeros((m,n))
     for i in range(0,m):
         pi[i, :] = theta[i, :] / np.nansum(theta[i, :])

     pi = np.nan_to_num(pi)

     return pi

pi_0 = simple_convert_into_pi_fron_theta(theta_0)

・壁方向に進む確率は０

・その他の方向へは等確率で移動

pi_0

・初期方策が完成したので、方策$\pi_{\theta_{0}}(s,a)$に従ってエージェントを動かす

・ゴールにたどり着くまでエージェントを移動させ続ける

def get_next_s(pi, s):
    direction = ["up", "right", "down", "left"]

    next_direction = np.random.choice(direction, p=pi[s, :])

    if next_direction == "up":
        s_next = s - 3
    elif next_direction == "right":
        s_next = s + 1
    elif next_direction == "down":
        s_next = s + 3
    elif next_direction == "left":
        s_next = s - 1

    return s_next

def goal_maze(pi):
    s = 0
    state_history = [0]

    while (1):
        next_s = get_next_s(pi, s)
        state_history.append(next_s)

        if next_s == 8:
            break
        else:
            s = next_s

    return state_history

・ゴールするまでどのような軌跡で、合計何ステップしたのかを確認

state_history = goal_maze(pi_0)

print(state_history)
print("迷路を解くのにかかったステップ数は" + str(len(state_history) - 1) + "です")

・状態遷移の軌跡を可視化

from matplotlib import animation
from IPython.display import HTML


def init():
    line.set_data([], [])
    return (line,)


def animate(i):
    state = state_history[i]
    x = (state % 3) + 0.5
    y = 2.5 - int(state / 3)
    line.set_data(x, y)
    return (line,)


anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(
    state_history), interval=200, repeat=False)

HTML(anim.to_jshtml())

2.3 方策反復法の実装

・エージェントが一直線にゴールへ向かうように方策を学習させる方法を考える

####（１）方策反復法
うまくいったケースの行動を重要視する作戦

####（２）価値反復法
ゴール以外の位置（状態）にも価値（優先度）をつける作戦

def softmax_convert_into_pi_from_theta(theta):

    beta = 1.0
    [m, n] = theta.shape
    pi = np.zeros((m, n))

    exp_theta = np.exp(beta * theta)

    for i in range(0, m):

        pi[i, :] = exp_theta[i, :] / np.nansum(exp_theta[i, :])

    pi = np.nan_to_num(pi)

    return pi

・方策$\pi_{{\theta_0}}$を求める

pi_0 = softmax_convert_into_pi_from_theta(theta_0)
print(pi_0)

・2.2で扱った、”get_next_s”関数を修正

・状態だけでなく、採用した行動も取得

def get_action_and_next_s(pi, s):
    direction = ["up", "right", "down", "left"]
    next_direction = np.random.choice(direction, p=pi[s, :])

    if next_direction == "up":
        action = 0
        s_next = s - 3
    elif next_direction == "right":
        action = 1
        s_next = s + 1
    elif next_direction == "down":
        action = 2
        s_next = s + 3
    elif next_direction == "left":
        action = 3
        s_next = s - 1

    return [action, s_next]

・ゴールにたどり着くまでエージェントを動かす"goal_maze"関数も修正

def goal_maze_ret_s_a(pi):
    s = 0
    s_a_history = [[0, np.nan]]

    while (1):
        [action, next_s] = get_action_and_next_s(pi, s)
        s_a_history[-1][1] = action

        s_a_history.append([next_s, np.nan])

        if next_s == 8:
            break
        else:
            s = next_s

    return s_a_history

s_a_history = goal_maze_ret_s_a(pi_0)
print(s_a_history)
print("迷路を解くのにかかったステップ数は" + str(len(s_a_history) - 1) + "です")

長いので省略・・・・

方策勾配法に従って、方策を更新

・方策勾配法は、以下の式に従ってパラメータ$\theta$を更新する


\theta_{s_i,a_j}=\theta_{s_i,a_j}+\eta*\Delta\theta_{s,a_j} \\
\Delta\theta{s,a_j}=\{ N(s_i,a_j)-P(s_i,a_j)N(s_i,a) \}/T

def update_theta(theta, pi, s_a_history):
    eta = 0.1
    T = len(s_a_history) - 1

    [m, n] = theta.shape
    delta_theta = theta.copy()

    for i in range(0, m):
        for j in range(0, n):
            if not(np.isnan(theta[i, j])):

                SA_i = [SA for SA in s_a_history if SA[0] == i]

                SA_ij = [SA for SA in s_a_history if SA == [i, j]]

                N_i = len(SA_i)
                N_ij = len(SA_ij)

                delta_theta[i, j] = (N_ij - pi[i, j] * N_i) / T

    new_theta = theta + eta * delta_theta

    return new_theta

ここがマジでわからない！！！！
new_theta = theta + eta * delta_theta
なんで足し算！？

試行回数が多い（最短経路である可能性が低い）ものは、引くべきでは？

教えてください・・・・

・パラメータ$\theta$を更新し、方策$\pi_{\theta}$の変化を観察

new_theta = update_theta(theta_0, pi_0, s_a_history)
pi = softmax_convert_into_pi_from_theta(new_theta)
print(pi)

・迷路を一直線にクリア出来るまで迷路内の探索とパラメータ$\theta$の更新を繰り返す

・方策$\pi$の変化の絶対値和が$10^{-4}$よりも小さくなったら終了

stop_epsilon = 10**-4

theta = theta_0
pi = pi_0

is_continue = True
count = 1
while is_continue:
    s_a_history = goal_maze_ret_s_a(pi)
    new_theta = update_theta(theta, pi, s_a_history)
    new_pi = softmax_convert_into_pi_from_theta(new_theta)

    if np.sum(np.abs(new_pi - pi)) < stop_epsilon:
        is_continue = False
    else:
        theta = new_theta
        pi = new_pi

本当は関数内に"print"があったんだけど、めんどいからカット・・・

np.set_printoptions(precision=3, suppress=True)
print(pi)

・可視化してみる

from matplotlib import animation
from IPython.display import HTML


def init():
    line.set_data([], [])
    return (line,)


def animate(i):
    state = s_a_history[i][0]
    x = (state % 3) + 0.5
    y = 2.5 - int(state / 3)
    line.set_data(x, y)
    return (line,)

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(
    s_a_history), interval=200, repeat=False)

HTML(anim.to_jshtml())

・softmax関数は、パラメータ$\theta$が負の値になっても方策を導出できる

・方策勾配定理を用いると、パラメータ$\theta$の更新方法を、方策勾配法で解くことが可能

・方策勾配定理を近似的に実装するアルゴリズムREINFORCEが存在する

今回はわからない点が発生してしまいました。
どなたか教えていただけると幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up