More than 3 years have passed since last update.

強化学習で用いられる数学とそのコーディング

Posted at 2022-04-09

はじめに¹

　本記事は高校数学を基礎として強化学習を行う上でつまづきやすいところを選出しています。本記事に掲載されていない部分で引っかかる場合は、高校数学を復習して見ると、答えが見つかるかもしれません。
　環境は、colabとTensorFlow2.x系を使用しています。

名称	意味
P(A\|B)	条件ＢのもとでのＡの確率
x~P(x)	確率変数xは分布Pを持つ
[a, b]	aとbを含む実区間
(a, b]	aを除き、bを含む実区間
A:=B	AはBと定義する
clip(A,min,max)	min ≤ A ≤ max の場合はA 又は - A < min の場合は min 又は - max < A の場合は max
θt+1←θt+α∇J(θt)	θt+1をθt+α∇J(θt)で更新する

よく使われる数学

①確率分布

a_t \sim \pi(\cdot | x_t) \\
意味：確率変数a_tは分布\pi(a_t | x_t)を持つ。

・はa_tの省略

②条件付き確率

P(s',r|s,a)

意味：s(状態)の時にa(行動)した場合、r(報酬)をもらい次のs'(状態)になる確率

③クリッピング²

clip_value_min ≤ value ≤ clip_value_max の場合は value
又は - value < clip_value_min の場合は clip_value_min
又は - clip_value_max < value の場合は clip_value_max

t = tf.constant([[-10., -1., 0.], [0., 2., 10.]])
t2 = tf.clip_by_value(t, clip_value_min=-1, clip_value_max=1)
t2.numpy()

array([[-1., -1.,  0.],
       [ 0.,  1.,  1.]], dtype=float32)

-1 ≤ value ≤ 1 の場合は value
又は - value < -1 の場合は -1
又は - 1 < value の場合は 1
を満たす形になっていることが、わかります。

④勾配

定義

スカラー場 f(x,y,z) \\
\nabla = grad f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial y},\frac{\partial f}{\partial z})

fの向きを表している。

x = tf.ones((2, 2))

with tf.GradientTape() as t:
  t.watch(x)
  y = tf.reduce_sum(x)
  z = tf.multiply(y, y)

# 元の入力テンソル x に対する z の微分
dz_dx = t.gradient(z, x)

方策勾配定理

方策勾配法で用いられる方策勾配定理は以下の通りです。

\nabla J(\theta)=\sum_{s \in S}d(s)\sum_{a \in A}\boldsymbol{\nabla} \pi(a|s;\theta)Q(s,a)

⑤総乗

\prod_{k=1}^{n} a_k = a_1 \times a_2 \times a_3 \times \cdots \times a_n \\
= a_1 \cdots a_n

x = tf.constant([1.0, 10.3, 26.9, 2.8, 166.32, 62.3])
b = 6

def _infinite_product(x, max_num):
    num = 1.0
    for i in range(max_num):
        num *= x[i]   
    return num

a = _infinite_product(x, b)
a.numpy() # 8038593.5

V-traceターゲット³

IMPALAで用いられるV-traceターゲットは以下の通りです。

\mathcal R^n V(x_s) := V(x_s) + \mathbb 
\sum_{t=s}^{s+n-1} \gamma^{t-s}(\prod_{i=s}^{t-s} c_i)\delta_tV

https://nthu-datalab.github.io/ml/slides/Notation.pdf ↩
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov "Proximal Policy Optimization Algorithms"(2017) In OpenAI. ↩
L. Espeholt and K. Kavukcuoglu, et al. "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures"(2018) ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up