scikit-learnのIrisデータセットを用いてQ学習を練習してみた

Last updated at 2025-02-23Posted at 2025-02-23

はじめに

今回は、「ChatGPTにハンズオンを作らせてみた」の第7弾で、Q学習を勉強しました。

第6弾はこちら↓

Q学習

状態と行動の組み合わせごとに「Q値（行動価値）」をテーブル（Qテーブル）で管理し、最適解を選ぶことで学習させる手法。

使用データ

今回は、scikit-learnのirisデータセットを使用します。
※タスクをシンプルにするために、本来は3つのクラスを2つに減らしています。

変数	説明
Sepal Length	がくの長さ
Sepal Width	がくの幅
Petal Length	花びらの長さ
Petal Width	花びらの幅

やること

Q学習を用いて、分類タスクを解いていきます。

使用コード・分析結果

import numpy as np
import gym
import random
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Irisデータセットを読み込み
iris = load_iris()
X = iris.data
y = iris.target

# 本来は3クラスですが、ここでは簡単のため2クラス（例えば 0 と 1 のみ）に絞ります
mask = y < 2
X = X[mask]
y = y[mask]

# 訓練データとテストデータに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 特徴量の標準化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

正解=１、不正解=-1として報酬を設計します。

class ClassificationEnv(gym.Env):
    def __init__(self, X, y):
        super(ClassificationEnv, self).__init__()
        # 状態空間: 特徴量の範囲（ここでは標準化済みのため大体 -3～3 と想定）
        self.observation_space = gym.spaces.Box(low=-3, high=3, shape=(X.shape[1],), dtype=np.float32)
        # 行動空間: 2クラス（0, 1）
        self.action_space = gym.spaces.Discrete(2)
        self.X = X
        self.y = y
        self.index = 0

    def reset(self):
        self.index = 0
        return self.X[self.index]

    def step(self, action):
        correct = action == self.y[self.index]
        reward = 1 if correct else -1
        self.index += 1
        done = self.index >= len(self.X)
        next_state = self.X[self.index] if not done else np.zeros_like(self.X[0])
        return next_state, reward, done, {}

class QLearningAgent:
    def __init__(self, state_size, action_size, alpha=0.1, gamma=0.9, epsilon=1.0, epsilon_decay=0.99):
        self.state_size = state_size
        self.action_size = action_size
        self.alpha = alpha  # 学習率
        self.gamma = gamma  # 割引率
        self.epsilon = epsilon  # ε-greedy の探索率
        self.epsilon_decay = epsilon_decay  # 探索率の減少率
        self.q_table = {}  # Qテーブルを辞書型で管理（DataFrameの代わり）

    def get_state(self, observation):
        """ 状態を丸めて文字列化（辞書のキーとして使用） """
        return str(tuple(np.round(observation, decimals=2)))

    def choose_action(self, state):
        """ ε-greedy 方策に基づき行動を選択 """
        if np.random.rand() < self.epsilon:
            return np.random.choice(self.action_size)
        state_key = self.get_state(state)
        if state_key in self.q_table:
            return np.argmax(self.q_table[state_key])
        return np.random.choice(self.action_size)

    def update(self, state, action, reward, next_state):
        """ Qテーブルを更新 """
        state_key = self.get_state(state)
        next_state_key = self.get_state(next_state)

        # Qテーブルに状態がなければ、初期化
        if state_key not in self.q_table:
            self.q_table[state_key] = np.zeros(self.action_size)
        if next_state_key not in self.q_table:
            self.q_table[next_state_key] = np.zeros(self.action_size)

        # Q値の更新
        best_next_action = np.argmax(self.q_table[next_state_key])
        target = reward + self.gamma * self.q_table[next_state_key][best_next_action]
        self.q_table[state_key][action] += self.alpha * (target - self.q_table[state_key][action])

np.random.seed(42)  # シードを設定

env = ClassificationEnv(X_train, y_train)
agent = QLearningAgent(state_size=X_train.shape[1], action_size=2)

import matplotlib.pyplot as plt

# エピソード数を増やす
episodes = 1000  # エピソード数を増加
rewards_per_episode = []  # 各エピソードの報酬を記録するリスト

for episode in range(episodes):
    state = env.reset()
    total_reward = 0
    done = False
    i = 0
    while not done:
        action = agent.choose_action(state)
        next_state, reward, done, _ = env.step(action)
        agent.update(state, action, reward, next_state)
        state = next_state
        total_reward += reward
        i += 1
    agent.epsilon *= agent.epsilon_decay  # 探索率を減少
    rewards_per_episode.append(total_reward)  # 報酬をリストに追加
    print(f"Episode {episode+1}: Total Reward: {total_reward}")

# 学習曲線をプロット
plt.plot(rewards_per_episode)
plt.title('Learning Curve')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.grid()
plt.show()

この図を見る限り、順調に学習が進んでおり、400エピソード以降は、ほとんど80近いスコアになっていて、ほぼ完璧に分類できてそうです。

# testデータを使って性能評価
np.random.seed(42)  # シードを設定
random.seed(42)

env = ClassificationEnv(X_test, y_test)
state = env.reset()
total_reward = 0
done = False
while not done:
    action = agent.choose_action(state)
    next_state, reward, done, _ = env.step(action)
    state = next_state
    total_reward += reward
print(f"Total Reward: {total_reward}")

Total Reward: -6

testデータを分類させてみたところ、全然スコアが上がりませんでした。

おわりに

今回は、シンプルに過学習していそうだったので、次はもう少し改善していきたいと思います。ただ、今回のタスクにおいは強化学習を使うべきではないので、別のタスクにも挑戦していこうと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up