Pythonで学ぶ強化学習から Kaggle ConnectX への第一歩を解説

強化学習

Last updated at 2025-03-14Posted at 2025-03-14

概要

ConnectX の解説記事はいくつかあったが，Pythonで学ぶ強化学習で紹介されていたフレームワーク(Agent, Trainer, Observer, Logger のモジュール構成)を用いた解説が見つからなかった．そこで，本記事ではこのフレームワークを用いた解説を行う．

Pythonで学ぶ強化学習

ConnectX

Notebook

実装概要

本 Notebook の実装は，Pythonで学ぶ強化学習の4.2節「価値評価を、パラメータを持った関数で実装する:Value Function Approximation」の行動価値関数にニューラルネットを用いた実装に基づいている．

Agent

class FNAgent()

FNAgent クラスの差分はなし

class ValueFunctionAgent(FNAgent)

policy 関数を override

$\epsilon$-greedyで探索する場合 or 初期化前は行動可能な action をランダムに選択
$\epsilon$-greedyで探索しない場合 and 初期化後は全行動から選択
- 返り値のクラスが np.int64 だと error になるので int に変換が必要

def policy(self, s):
    valid_actions = [c for c in range(7) if s[0][c] == 0]
    if np.random.random() < self.epsilon or not self.initialized:
        return random.choice(valid_actions)
    else:
        estimates = self.estimate(s)
        if self.estimate_probs:
            action = np.random.choice(self.actions,
                                      size=1, p=estimates)[0]
            return int(action)
        else:
            return int(np.argmax(estimates))

Trainer

class Trainer()

ゲーム画面のログは必要ないのでコメントアウト

if self.training:
    # if len(frames) > 0:
    #     self.logger.write_image(self.training_count,
    #                             frames)
    #     frames = []
    self.training_count += 1

class ValueFunctionTrainer(Trainer)

差分なし

Observer

class Observer()

差分なし

class ConnectXObserver(Observer)

以下の step について解説する．

def step(self, action):
    n_state, reward, done, info = self._trainer.step(action)
    n_state = n_state["board"]
    if reward == None:
        reward = -1
        n_state = [-1 for _ in range(self.rows * self.columns)]
    return self.transform(n_state), reward, done, info

n_state = n_state["board"]
各 step で n_state に代入される情報は盤面以外の情報も含まれているため，n_state に盤面である board のみの情報を代入している．

trainer = env.train([None, "random"])
trainer.reset()
n_state, reward, done, info = trainer.step(0)
print("n_state: ", n_state)
print("reward: ", reward)
print("done: ", done)
print("info: ", info)
-------------------------------
n_state:  {'remainingOverageTime': 60, 'step': 2, 'board': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0], 'mark': 1}
reward:  0
done:  False
info:  {}

if reward == None:
盤面において実行不可能な手(各列に6個以上の mark を挿入，0~6以外の action)を行うと reward が None となる．そのため，以下の対応を行っている
- reward を -1 に設定
- 実行不可能な手を選んだ状態を他の状態と区別するため，board のすべてのマスを -1 に設定

n_state, reward, done, info = trainer.step(0)
print("n_state: ", n_state)
print("board:")
print(np.array(n_state["board"]).reshape(6,7))
print("reward: ", reward)
print("done: ", done)
print("info: ", info)
-------------------------------
n_state:  {'remainingOverageTime': 60, 'step': 17, 'board': [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 2, 1, 0, 0, 2, 1, 0, 2, 1, 0, 0, 2, 1, 2, 2], 'mark': 1}
reward:  None
done:  True
info:  {}

Logger

class Logger()

ログの保存先を変更

self.log_dir = os.path.join("/kaggle/working", "logs")

main

main() で学習させ，学習後のモデルを trained に代入

def main(play):
    env = ConnectXObserver(make("connectx"))
    trainer = ValueFunctionTrainer(report_interval=100)
    trained = trainer.train(env, episode_count=10000)
    trainer.logger.plot("Rewards", trainer.reward_log,
                            trainer.report_interval)
    return trained

trained = main(False)

出力

以下のように重みとバイアスを取得

mlp = trained.model.named_steps["estimator"]

weights = mlp.coefs_
biases = mlp.intercepts_

for i, w in enumerate(weights):
    print(f"Layer {i+1}: {w.shape[0]} → {w.shape[1]}")
-------------------------------
Layer 1: 42 → 10
Layer 2: 10 → 10
Layer 3: 10 → 7

提出用の my_agent 関数を文字列で作成

# Create the agent
my_agent_str = '''def my_agent(observation, configuration):
    import numpy as np

'''

# Write hidden layers
for i, (w, b) in enumerate(zip(weights[:-1], biases[:-1])):
    my_agent_str += '    hl{}_w = np.array({}, dtype=np.float32)\n'.format(i+1, w.tolist())
    my_agent_str += '    hl{}_b = np.array({}, dtype=np.float32)\n'.format(i+1, b.tolist())
# # Write output layer
my_agent_str += '    ol_w = np.array({}, dtype=np.float32)\n'.format(weights[-1].tolist())
my_agent_str += '    ol_b = np.array({}, dtype=np.float32)\n'.format(biases[-1].tolist())

my_agent_str += '''
    state = observation["board"]
    out = np.array(state, dtype=np.float32)
'''

# Calculate hidden layers
for i in range(len(weights)-1):
    my_agent_str += '    out = np.matmul(out, hl{0}_w) + hl{0}_b\n'.format(i+1)
    my_agent_str += '    out = np.maximum(0, out)\n'  # ReLU function

# Calculate output layer
my_agent_str += '    out = np.matmul(out, ol_w) + ol_b\n'

my_agent_str += '''
    for i in range(configuration["columns"]):
        if observation["board"][i] != 0:
            out[i] = -1e7
    return int(np.argmax(out))
    '''

出力される文字列は以下の通り

def my_agent(observation, configuration):
    import numpy as np

    hl1_w = 省略
    hl1_b = 省略
    hl2_w = 省略
    hl2_b = 省略
    ol_w = 省略
    ol_b = 省略

    state = observation["board"]
    out = np.array(state, dtype=np.float32)
    out = np.matmul(out, hl1_w) + hl1_b
    out = np.maximum(0, out)
    out = np.matmul(out, hl2_w) + hl2_b
    out = np.maximum(0, out)
    out = np.matmul(out, ol_w) + ol_b

    for i in range(configuration["columns"]):
        if observation["board"][i] != 0:
            out[i] = -1e7
    return int(np.argmax(out))

提出

with open('submission.py', 'w') as f:
    f.write(my_agent_str)

おわりに

本記事では Pythonで学ぶ強化学習のコードをベースに ConnectX へ提出するためのコードを紹介した．
しかし，性能面ではまだまだであるため，今後の課題として以下が考えられる．

DNN を利用した価値関数
先攻後攻を考慮
報酬設定の工夫
実行不可能になった場合の対応
NN が盤面の数字の大きさに意味を見出す可能性を考慮した入力値の工夫

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up