More than 1 year has passed since last update.

強化学習の勉強: 迷路探索問題（方策反復法）

Last updated at 2022-05-06Posted at 2022-05-03

強化学習の勉強: 迷路探索問題（方策反復法）

近年、注目されている深層強化学習をロボットにも応用したいと考えており、そのためにまずは強化学習について理解を深め、深層強化学習を学ぶうえで必要なことを身に着けることが目的。

強化学習

手法の概要

「認知」した状況の下で最適な「行動」を「判断」すること、すなわち「制御」を学習する方法

判断：
- 所与の状況（例えば、前方の車に接近したとき）において、最適な行動（ブレーキを踏んで減速する）が何かを判断すること
- 行動が最適かどうかを知るには、その状況における行動の「価値」を知る必要がある

特徴

価値はあらかじめ定量的に与えられているわけではない
- 試行錯誤を通して獲得するしかない
ある行動（ブレーキを踏むなど）に対して、直接的に正解を定義することが難しい場合でも、行動のもたらす結果に応じて「報酬」を返すことで間接的に評価することはできる

方策とは

定義：
エージェントがどのように行動するかを定めたルール
$\pi_{\theta}(s, a)$ ... 状態 $s$ のときに行動 $a$ を採用する確率はパラメータ $\theta$ で決まる方策に従う

状態 $s$（例）：

迷路
- エージェントが迷路内のどの位置にいるのか
ロボット
- 現在の姿勢を再現するための関節の角度や速度
囲碁・将棋
- 盤面のコマの位置や種類

行動 $a$（例）：

迷路
- 上，右，下，左へ移動する４種類の行動
ロボット
- 各関節のモータをそれぞれどの程度回転させるか
囲碁・将棋
- 次の手でどの位置にどのコマを置くのか

方策 $\pi$（例）：

関数
- もっとも簡単なものとしては表形式表現がある
ニューラルネットワーク
- 深層強化学習ではこちらの表現が使用される

パラメータ $\theta$（例）：

方策が関数の場合
- 関数内のパラメータ
ニューラルネットワークの場合
- 素子間の結合パラメータ

強化学習アルゴリズム：方策反復法

⭐うまくいったケースの行動を重要視する作戦

パラメータ $\theta$ から方策 $\pi_{\theta}(s, a)$ を求める

従来

$$
P(\theta_{i})=\frac{{\theta_i}}{\sum^{N_a}_{j=1}{\theta_j}}
$$

ソフトマックス

$$
P(\theta_{i})=\frac{e^{\beta \theta_i}}{\sum^{N_a}_{j=1}e^{\beta \theta_j}}
$$

ソフトマックスを使うことにより，$\theta$ が負の値になっても方策を導出できる
ステップごとに状態 $a$ と採用した行動を記録

方策勾配法に従い方策を更新

\begin{cases}

\theta_{s_i, a_j} = \theta_{s_i, a_j} + \eta \Delta\theta_{s, a_j} \\
\Delta\theta_{s, a_j} = \frac{N(s_i, a_j) - P(s_i, a_j)N(s_i, a)}{T}

\end{cases}

パラメータの説明
$N(s_i, a_j)$：
　　状態 $s_i$ で行動 $a_j$ を採用した回数
$\eta$：
　　学習率（１回の学習で更新される大きさ）
　　大きすぎても小さすぎてもうまくいかない
$T$：
　　ゴールまでにかかった総ステップ数
必要なパラメータ
1. 現在の $\theta$
2. 方策 $\pi$
3. $(s, a)$ の履歴
4. $N(s_i, a)$ と $N(s_i, a_j)$
　これらは$(s, a)$ の履歴からとる
5. 総ステップ $T$
　$(s, a)$ の大きさそのもの

終了条件は適当（$\Delta \theta_{a, s_j} \ge$ 閾値など）

実装

迷路の作成

maze_rl_map.py

"""
迷路探索問題で強化学習を学ぶ
"""
import matplotlib.pyplot as plt


class MAZE():
    def __init__(self) -> None:
        ### 迷路作成 ###############
        self.fig = plt.figure(figsize=(5, 5))    # 5x5のグリッド図を作成（1区画を1マスとする）
        self.ax = plt.gca()                      # get current axis 今はplt.subplot(111)と同じである。つまりは、左上のマスの操作ができる。エージェントの初期位置を描画するために用意

        # 赤い壁を描く（赤い壁は通ることができないという定義）：直線描画で表現
        plt.plot([1,1], [0,1], color='red', linewidth=2)        # plt.plot(x, y, color, linewidth)   xデータ（x座標）, yデータ（y座標）, 線色, 線幅
        plt.plot([1,2], [2,2], color='red', linewidth=2)
        plt.plot([2,2], [2,1], color='red', linewidth=2)
        plt.plot([2,3], [1,1], color='red', linewidth=2)

        # 状態を表す文字S0~S8を描く
        plt.text(x=0.5, y=2.5, s='S0', size=14, ha='center')
        plt.text(x=1.5, y=2.5, s='S1', size=14, ha='center')
        plt.text(x=2.5, y=2.5, s='S2', size=14, ha='center')
        plt.text(x=0.5, y=1.5, s='S3', size=14, ha='center')
        plt.text(x=1.5, y=1.5, s='S4', size=14, ha='center')
        plt.text(x=2.5, y=1.5, s='S5', size=14, ha='center')
        plt.text(x=0.5, y=0.5, s='S6', size=14, ha='center')
        plt.text(x=1.5, y=0.5, s='S7', size=14, ha='center')
        plt.text(x=2.5, y=0.5, s='S8', size=14, ha='center')
        plt.text(x=0.5, y=2.3, s='START', size=10, ha='center')
        plt.text(x=2.5, y=0.3, s='GOAL', size=10, ha='center')

        # 描画範囲の設定とメモリを消す設定
        self.ax.set_xlim(0, 3)
        self.ax.set_ylim(0, 3)
        plt.tick_params(axis='both', which='both', bottom=False, top=False, labelbottom=False, right=False, left=False, labelleft=False)
    def set_start(self):
        # 現在地S0に緑丸を描画する
        line, = self.ax.plot([0.5], [2.5], marker='o', color='lightgreen', markersize=60)    # のちに更新するためにaxで戻り値としてlineを受け取っている。lineにアクセスして座標変更が可能（代入文）←コンマが必要
        return line                                                                                # 代入文：https://docs.python.org/ja/3/reference/simple_stmts.html#assignment-statements

    def show(self):
        plt.show()

if __name__ == '__main__':
    maze = MAZE()
    line = maze.set_start()
    maze.show()

エージェントの作成

"""
迷路探索問題で強化学習を学ぶ
"""
from maze_rl_map import MAZE      # 作成した迷路をモジュール化しているためインポート
import numpy as np

# 迷路の作成
maze = MAZE()
line = maze.set_start()     # 動き回るエージェントの座標を変更できる変数を取得
# line.set_data([1, 2])
# maze.show()


### エージェントの実装 ####################
class Agent():
    def __init__(self) -> None:
        # 進めるルールを定義
        # 行：状態S0~S7（S8はゴールであるから方策不要）、列：選択（↑, →, ↓, ←）
        self.theta_0 = np.array([
                            [np.nan,    1,      1,      np.nan],    # S0: ↑ 不可    → 可    ↓ 可    ← 不可
                            [np.nan,    1,      np.nan, 1],         # S1: ↑ 不可    → 可    ↓ 不可  ← 可
                            [np.nan,    np.nan, 1,      1],         # S2: ↑ 不可    → 不可  ↓ 可    ← 可
                            [1 ,        1,      1,      np.nan],    # S3: ↑ 可      → 可    ↓ 可    ← 不可
                            [np.nan,    np.nan, 1,      1],         # S4: ↑ 不可    → 不可  ↓ 可    ← 可
                            [1,         np.nan, np.nan, np.nan],    # S5: ↑ 可      → 不可  ↓ 不可  ← 不可
                            [1,         np.nan, np.nan, np.nan],    # S6: ↑ 可      → 不可  ↓ 不可  ← 不可
                            [1,         1,      np.nan, np.nan],    # S7: ↑ 可      → 可    ↓ 不可  ← 不可
                            ])

    # 方策パラメータ（ルール）から行動方策piを導く
    def simple_convert_into_pi_from_theta(self, theta):
        """単純に割合（その行動をとる確率）を計算する"""

        [m, n] = theta.shape    # thetaの行列サイズを取得
        pi = np.zeros((m, n))

        for i in range(m):
            pi[i, :] = theta[i, :] / np.nansum(theta[i, :]) # 割合計算（各箇所をその要素合計で割る）
        
        pi = np.nan_to_num(pi)

        return pi

    # 1 step移動後の状態sを求める
    def get_next_s(self, pi, s):
        direction = ['up', 'right', 'down', 'left']

        next_direction = np.random.choice(direction, p=pi[s, :])    # pi[s,:]の確率に従ってdirectionが選択される

        if next_direction == 'up':
            s_next = s - 3
        elif next_direction == 'right':
            s_next = s + 1
        elif next_direction == 'down':
            s_next = s + 3
        elif next_direction == 'left':
            s_next = s - 1
        
        return s_next


    # 迷路内をエージェントがゴールするまで移動させる
    def goal_maze(self, pi):
        s = 0       # スタート状態S0
        state_history = [0] # エージェントの移動した道を記録

        while True:
            next_s = self.get_next_s(pi, s)
            state_history.append(next_s)

            if next_s == 8:
                break
            else:
                s = next_s
        
        return state_history

if __name__ == '__main__':
    agent = Agent()
    pi_0 = agent.simple_convert_into_pi_from_theta(theta=agent.theta_0)     # 初期の方策
    state_history = agent.goal_maze(pi_0)                                   # ゴールするまで1つの方策でランダム動き回る
    print(state_history)
    print(pi_0)

出力

エージェントの移動した道の記録

[0, 3, 6, 3, 6, 3, 6, 3, 0, 3, 0, 3, 4, 7, 4, 3, 4, 3, 0, 3, 6, 3, 4, 7, 8]

ランダムであるため、実行するたびに結果は異なる。
ただ、ランダムといえど、以下の方策に従っている。
（更新されないため、ランダム性が変わらない）

方策

[[0.         0.5        0.5        0.        ]
 [0.         0.5        0.         0.5       ]
 [0.         0.         0.5        0.5       ]
 [0.33333333 0.33333333 0.33333333 0.        ]
 [0.         0.         0.5        0.5       ]
 [1.         0.         0.         0.        ]
 [1.         0.         0.         0.        ]
 [0.5        0.5        0.         0.        ]]

GIFによる動作の記録

maze_rl_gif.py

"""
迷路探索問題で強化学習を学ぶ
"""
from maze_rl_agent_random import Agent      # 作成したエージェントをモジュール化しているためインポート
from maze_rl_map import MAZE         # 作成した迷路をモジュール化しているためインポート
from matplotlib import animation as ani
from os.path import join, dirname, abspath

### 動いている様子を可視化 #############
class GIF():
    def __init__(self, fig, line, state_history) -> None:
        self.fig = fig
        self.state_history = state_history
        self.line = line

    def init_func(self):
        """背景画像の初期化"""
        line = self.line.set_data([], [])
        return (line,)

    def animate(self, i):
        """フレームごとの描画内容"""
        state = self.state_history[i]
        x = (state%3) + 0.5         # 状態のx座標は、3で割ったあまり + 0.5
        y = 2.5 - int(state/3)         # 状態のy座標は、2.5 - 3で割った商

        line = self.line.set_data(x, y)
        return (line,)

    def create(self, file_name="maze_random.gif"):
        anim = ani.FuncAnimation(self.fig,  self.animate, init_func=self.init_func, frames=len(self.state_history), interval=200, repeat=False)

        save_path = dirname(abspath(__file__))
        anim.save(f"{save_path}/{file_name}")

if __name__ == '__main__':
    # 迷路の作成
    maze = MAZE()
    line = maze.set_start()     # 動き回るエージェントの座標を変更できる変数を取得

    # エージェント
    agent = Agent()
    pi_0 = agent.simple_convert_into_pi_from_theta(theta=agent.theta_0)     # 初期の方策
    state_history = agent.goal_maze(pi_0)                                   # ゴールするまで1つの方策でランダム動き回る

    # 記録
    gif = GIF(maze.fig, line, state_history)
    gif.create(file_name="maze_random.gif")

学習を考慮したエージェントの作成

方策の更新のためのメソッドを追加。その他、部分的に変更を加えている。説明はコメントに示している。

maze_rl_agent_softmax.py

"""
迷路探索問題で強化学習を学ぶ
"""
from maze_rl_map import MAZE      # 作成した迷路をモジュール化しているためインポート
import numpy as np

# 迷路の作成
maze = MAZE()
line = maze.set_start()     # 動き回るエージェントの座標を変更できる変数を取得


### エージェントの実装 ####################
class Agent():
    def __init__(self) -> None:
        # 進めるルールを定義
        # 行：状態S0~S7（S8はゴールであるから方策不要）、列：選択（↑, →, ↓, ←）
        self.theta_0 = np.array([
                            [np.nan,    1,      1,      np.nan],    # S0: ↑ 不可    → 可    ↓ 可    ← 不可
                            [np.nan,    1,      np.nan, 1],         # S1: ↑ 不可    → 可    ↓ 不可  ← 可
                            [np.nan,    np.nan, 1,      1],         # S2: ↑ 不可    → 不可  ↓ 可    ← 可
                            [1 ,        1,      1,      np.nan],    # S3: ↑ 可      → 可    ↓ 可    ← 不可
                            [np.nan,    np.nan, 1,      1],         # S4: ↑ 不可    → 不可  ↓ 可    ← 可
                            [1,         np.nan, np.nan, np.nan],    # S5: ↑ 可      → 不可  ↓ 不可  ← 不可
                            [1,         np.nan, np.nan, np.nan],    # S6: ↑ 可      → 不可  ↓ 不可  ← 不可
                            [1,         1,      np.nan, np.nan],    # S7: ↑ 可      → 可    ↓ 不可  ← 不可
                            ])

    # 方策パラメータ（ルール）から行動方策piをソフトマックス関数で導く
    def softmax_convert_into_pi_from_theta(self, theta):
        """単純に割合（その行動をとる確率）を計算する"""

        beta = 1.0
        [m, n] = theta.shape    # thetaの行列サイズを取得
        pi = np.zeros((m, n))

        exp_theta = np.exp(theta * beta)

        for i in range(m):

            pi[i, :] = exp_theta[i, :] / np.nansum(exp_theta[i, :])
        pi = np.nan_to_num(pi)

        return pi

    # 1 step移動後の状態sを求める
    def get_next_s_and_action(self, pi, s):
        direction = ['up', 'right', 'down', 'left']

        next_direction = np.random.choice(direction, p=pi[s, :])    # pi[s,:]の確率に従ってdirectionが選択される

        if next_direction == 'up':
            action = 0
            s_next = s - 3
        elif next_direction == 'right':
            action = 1
            s_next = s + 1
        elif next_direction == 'down':
            action = 2
            s_next = s + 3
        elif next_direction == 'left':
            action = 3
            s_next = s - 1
        
        return s_next, action


    # 迷路内をエージェントがゴールするまで移動させる
    def goal_maze(self, pi):
        s = 0       # スタート状態S0
        s_a_history = [[0, np.nan]] # エージェントの移動した道を記録

        while True:
            next_s, action = self.get_next_s_and_action(pi, s)
            s_a_history[-1][1] = action
            s_a_history.append([next_s, np.nan])

            if next_s == 8:
                break
            else:
                s = next_s
        
        return s_a_history
    
    # 方策パラメータの更新
    def update_theta(self, theta, pi, s_a_history):
        eta = 0.1   # 学習率
        T = len(s_a_history) - 1    # ゴールまでの総ステップ数

        [m, n] = theta.shape
        delta_theta = theta.copy()

        # delta_thetaを要素ごとに求める
        for i in range(m):
            for j in range(n):
                if not(np.isnan(theta[i, j])):    # thetaがnanでないとき

                    SA_i = [SA for SA in s_a_history if SA[0] == i]     # 状態がiのものだけを抽出

                    SA_ij = [SA for SA in s_a_history if SA == [i, j]]  # 状態がiで行動jをしたものだけを抽出

                    N_i = len(SA_i)
                    N_ij = len(SA_ij)
                    
                    delta_theta[i, j] = (N_ij - pi[i, j] * N_i) / T
        
        new_theta = theta + eta * delta_theta
    
        return new_theta


if __name__ == '__main__':
    agent = Agent()
    pi_0 = agent.softmax_convert_into_pi_from_theta(theta=agent.theta_0)     # 初期の方策
    s_a_history = agent.goal_maze(pi_0)                                   # ゴールするまで1つの方策でランダム動き回る
    new_theta = agent.update_theta(agent.theta_0, pi_0, s_a_history)
    pi = agent.softmax_convert_into_pi_from_theta(new_theta)
    print(pi)

出力

1だけ更新された方策

[[0.         0.50073529 0.49926471 0.        ]
 [0.         0.49852942 0.         0.50147058]
 [0.         0.         0.49926471 0.50073529]
 [0.3343142  0.33333237 0.33235342 0.        ]
 [0.         0.         0.5        0.5       ]
 [1.         0.         0.         0.        ]
 [1.         0.         0.         0.        ]
 [0.49926471 0.50073529 0.         0.        ]]

はじめの方策とは異なっていることがわかる。

学習とその結果の記録

mazw_rl_learning_test.py

"""
迷路探索問題で強化学習を学ぶ
"""
from maze_rl_map import MAZE      # 作成した迷路をモジュール化しているためインポート
from maze_rl_agent_softtmax import Agent      # 作成した迷路をモジュール化しているためインポート
import numpy as np


### 方策勾配法で迷路を解く ###########
stop_epsilon = 10**(-4)     # 10^-4よりも方策に変化が少なくなったら学習終了

agent = Agent()
theta_0 = agent.theta_0
pi_0 = agent.softmax_convert_into_pi_from_theta(theta=agent.theta_0)


# 初期値で初期化
theta = theta_0
pi = pi_0


is_continue = True      # ループさせるフラグ
count = 1               # 学習回数（エピソード）カウント
while is_continue:
    s_a_history = agent.goal_maze(pi)

    # 更新
    new_theta = agent.update_theta(theta, pi, s_a_history)
    new_pi = agent.softmax_convert_into_pi_from_theta(new_theta)

    print(f"方策の変化：{np.sum(np.abs(new_pi - pi))}")
    print(f"迷路を解くのにかかったステップ数：{len(s_a_history) - 1}")

    if np.sum(np.abs(new_pi - pi)) < stop_epsilon:
        is_continue = False
    else:
        theta = new_theta
        pi = new_pi
        count += 1

print(pi)

from maze_rl_gif import GIF
maze = MAZE()
state_history = [s_a[0] for s_a in s_a_history]
# print(state_history)
print(f"学習回数（エピソード）：{count}")
gif = GIF(maze.fig, maze.set_start(), state_history)

gif.create("maze_learning.gif")

出力

更新された最終の方策

[[0.         0.00978349 0.99021651 0.        ]
 [0.         0.27440914 0.         0.72559086]
 [0.         0.         0.45538819 0.54461181]
 [0.01013356 0.97761958 0.01224687 0.        ]
 [0.         0.         0.98097083 0.01902917]
 [1.         0.         0.         0.        ]
 [1.         0.         0.         0.        ]
 [0.01357813 0.98642187 0.         0.        ]]

かなり方策が更新されている。
結果的には、4ステップでゴールできる。

結果

ここで結果を示し、考察していく。
まず、上記のgif画像から、無駄なく、迷いなくゴールへ向かっていることがわかる。
実際にこの移動を決めている方策について見ていくこととする。

方策

[[0.         0.00978349 0.99021651 0.        ]
 [0.         0.27440914 0.         0.72559086]
 [0.         0.         0.45538819 0.54461181]
 [0.01013356 0.97761958 0.01224687 0.        ]
 [0.         0.         0.98097083 0.01902917]
 [1.         0.         0.         0.        ]
 [1.         0.         0.         0.        ]
 [0.01357813 0.98642187 0.         0.        ]]

上の行から準備S0, S1, S2, ..., S7の行動「上(↑), 右(→), 下(↓), 左(←) 」を表している。

考察

確率90%以上を含む部分に着目すると、わかりやすい。
まず、S0。S0においては、下（↓）が99.0%であり、次の遷移はS3となる。S3においては、右（→）が97.7%であり、次の遷移はS4となる。S4においては、下（↓）が98.0%であり、次の遷移はS7となる。S7においては、右（→）が98.6%であり、ゴールする。

90%以上の確率を含む状態は

S0 → S3 → S4 → S7

これら以外の部分が90%以上を含まない。つまり、初期値に近いというのは、途中からその状態になることが少なく、更新される機会が少なくなることによるものと考えられる。

感想

いま、ロボットに強化学習を用いて面白いことができないかなという期待と興味だけで、強化学習を独学している。勉強しているが、なかなか理解している気になれなかったため、久しぶりにQiitaに投稿してみた。やはり投稿となると、整理していくため理解が深まった気がする。本当にしたいのは、深層強化学習だが、それ以前に強化学習の考えを触れることが必要と今回は迷路探索をまとめてみた。ただ、強化学習のプログラムを通して、Pythonの知らない文法事項にも触れることができた。

次は、強化学習ではおなじみの価値反復法（報酬の考えをもつもの）のアルゴリズムであるQ学習について学び、そのままDQNなどに触れられたら良いなと思っている。

早く強化学習を使いこなせるようになりたい。

参考文献

「作りながら学ぶ深層強化学習 PyTorchによる実践プログラミング」
小川雄太郎　著　　マイナビ出版

「現場で使える！Python 深層強化学習入門　強化学習と深層強化学習による探索と制御」
伊藤多一、今津善充、須藤広大、仁ノ平将人、川崎悠介、酒井　裕企、魏崇哲　著　　翔泳社出版

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up