More than 3 years have passed since last update.

UCB 方策について

Last updated at 2019-12-07Posted at 2019-12-07

UCB方策について

ucb1.py

import random
import numpy as np
import matplotlib.pyplot as plt

class Bernoulli_class():
        def Bernoulli(self, p):
            if (p < random.random()):
                return 0
            else:
                return 1

class ucb1_class():
    def ucb1(self, round, simple_slot_rate):
        m = len(simple_slot_rate)
        temp_reward = 0
        large_t = [0 for i in range(m)]
        total_reward = [0 for i in range(m)]
        sample_mean = [0 for i in range(m)] # 標本平均：評価値
        adj_term = [0 for i in range(m)]    # ucb に使う
        evaluation = [0 for i in range(m)]  # ucb に使う
        ave_reward = [0 for i in range(round)]  # グラフの計算用

        for t in range(1, m+1):
            """
            初期化ラウンド
            """
            r = Bernoulli_class()
            reward = r.Bernoulli(simple_slot_rate[t-1])    # 選択したアームの確率で, 0か1を返す　→　報酬を得る
            large_t[t-1] += 1
            total_reward[t-1] += reward    # 総報酬に, 実行したアームからの報酬を追加する
            sample_mean[t-1] = total_reward[t-1] / large_t[t-1]  # 総報酬を実行回数で割る。　つまり標本平均を求める
            adj_term[t-1] = np.sqrt(np.log(t) / (2*large_t[t-1]))
            evaluation[t-1] = sample_mean[t-1] + adj_term[t-1]
            temp_reward += reward
            ave_reward[t-1] = temp_reward / (t)

        for t in range(m+1, round):
            """
            それ以降のラウンド
            """
            play = np.argmax(evaluation)

            r = Bernoulli_class()
            reward = r.Bernoulli(simple_slot_rate[play])    # 選択したアームの確率で, 0か1を返す　→　報酬を得る

            large_t[play] += 1
            total_reward[play] += reward    # 総報酬に, 実行したアームからの報酬を追加する
            sample_mean[play] = total_reward[play] / large_t[play]  # 総報酬を実行回数で割る。　つまり標本平均を求める
            for i in range(len(simple_slot_rate)):
                adj_term[i] = np.sqrt(np.log(t) / (2*large_t[i]))
                evaluation[i] = sample_mean[i] + adj_term[i]
            temp_reward += reward
            ave_reward[t] = temp_reward / (t+1)
        plt.plot(range(round), ave_reward, label="UCB1")
        plt.legend(loc='lower right')
        return

if __name__ == '__main__':
    m = 5
    round = 800
    simple_slot_rate = [0.1 for i in range(m)]
    simple_slot_rate[random.randrange(m)] = 0.9

    ucb = ucb1_class()
    ucb_one = ucb.ucb1(round, simple_slot_rate)

    plt.show()

各パーツの紹介

ベルヌーイ分布

ucb1.py

class Bernoulli_class():
        def Bernoulli(self, p):
            if (p < random.random()):
                return 0
            else:
                return 1

(役割)
確率pで1を, 確率(1-p)で0を返す関数.

(概要)
引数として確率pを受け取る.フィールドにマトがあって, 弾丸をN = 100発打ったとする.
p < random.random() つまり確率1-pで0を返す.
それ以外の場合, つまり確率pでは1を返す.

(考え方の例)
フィールドにマトがあって, 弾丸をN = 100発打ったとする.
p = 0.3 ならば, 30発前後がマトに当たる(実際の当たり数は前後する)
弾丸数をN = 1000, 1000000, 10000000000, ... と増やしていくと,
当たり数は, N×p に近づく.
N = ∞ のとき, 当たり数はマトの面積に等しくなる(点の集まりが面になる)
これを分布として考えるのがベルヌーイ分布

UCB方策

ucb1.py

class ucb1_class():
    def ucb1(self, round, simple_slot_rate):
        m = len(simple_slot_rate)
        temp_reward = 0
        large_t = [0 for i in range(m)]
        total_reward = [0 for i in range(m)]
        sample_mean = [0 for i in range(m)] # 標本平均：評価値
        adj_term = [0 for i in range(m)]    # ucb に使う
        evaluation = [0 for i in range(m)]  # ucb に使う
        ave_reward = [0 for i in range(round)]  # グラフの計算用

        for t in range(1, m+1):
            """
            初期化ラウンド
            """
            r = Bernoulli_class()
            reward = r.Bernoulli(simple_slot_rate[t-1])    # 選択したアームの確率で, 0か1を返す　→　報酬を得る
            large_t[t-1] += 1
            total_reward[t-1] += reward    # 総報酬に, 実行したアームからの報酬を追加する
            sample_mean[t-1] = total_reward[t-1] / large_t[t-1]  # 総報酬を実行回数で割る。　つまり標本平均を求める
            adj_term[t-1] = np.sqrt(np.log(t) / (2*large_t[t-1]))
            evaluation[t-1] = sample_mean[t-1] + adj_term[t-1]
            temp_reward += reward
            ave_reward[t-1] = temp_reward / (t)

        for t in range(m+1, round):
            """
            それ以降のラウンド
            """
            play = np.argmax(evaluation)

            r = Bernoulli_class()
            reward = r.Bernoulli(simple_slot_rate[play])    # 選択したアームの確率で, 0か1を返す　→　報酬を得る

            large_t[play] += 1
            total_reward[play] += reward    # 総報酬に, 実行したアームからの報酬を追加する
            sample_mean[play] = total_reward[play] / large_t[play]  # 総報酬を実行回数で割る。　つまり標本平均を求める
            for i in range(len(simple_slot_rate)):
                adj_term[i] = np.sqrt(np.log(t) / (2*large_t[i]))
                evaluation[i] = sample_mean[i] + adj_term[i]
            temp_reward += reward
            ave_reward[t] = temp_reward / (t+1)
        plt.plot(range(round), ave_reward, label="UCB1")
        plt.legend(loc='lower right')
        return

(役割)
選択肢を実行して評価し, 報酬を効率よく積み重ねるにはどの選択肢を実行すればよいかを決定する

(概要)
引数として round, simple_slot_rate を受け取る
round は実行回数(上記の弾丸数Nにあたる)で, simple_slot_rate は各選択肢が持つ確率変数(上記の確率pにあたる)を格納した配列になっている

1)：選択肢を1回ずつ実行して「評価値」を計算する(初期ラウンド)
2)：評価値最大の選択肢を実行する(それ以降ラウンド)
3)："3)"を繰り返す

「評価値(evaluation)」 = 「標本平均(sample_mean)」 + 「補正項(adj_term)」となっている
この補正項の計算がUCB1方策の特徴であり, 実行回数が少ない選択肢ほど補正項の値は大きくなる

(考え方の例)
1)：とりあえずぜんぶやってみよう
2)：「評価」 = 「見たもの・感じたこと」 + 「偏見・先入観・イメージ」
3)：「評価」のズレを直していこう

実行用の変数・関数の使い方の例

ucb1.py

if __name__ == '__main__':
    m = 5
    round = 800
    simple_slot_rate = [0.1 for i in range(m)]
    simple_slot_rate[random.randrange(m)] = 0.9

    ucb = ucb1_class()
    ucb_one = ucb.ucb1(round, simple_slot_rate)

    plt.show()

(役割)
ucb1 を実行し, グラフに出力する

(概要)
m：選択肢の数
round：実行回数
simple_slot_rate：各選択肢が持つ確率変数

(考え方の例)
ucb1 を実行するための引数をできるだけ減らしたかったので m は使わずにucb1関数内でlen(simple_slot_rate) とした.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up