More than 5 years have passed since last update.

ポケモンで学ぶ機械学習

Last updated at 2019-12-27Posted at 2019-12-26

はじめに

先月、ポケットモンスターソード・シールドが発売されました。ちなみに、みなさんはポケモンをプレイしたことはありますか？ポケモンをやったことがある人ならわかると思うのですが、ポケモンには、HP、こうげき、ぼうぎょ、とくこう、とくぼう、すばやさ、からなる能力値が存在します。能力値が高いポケモンほど強いポケモンであると言えます。能力値は、種族値と個体値と努力値の三つの値から算出されます。（計算式は下に書いておきます）種族値とは、ポケモンの種類ごとに与えられた値のことです。個体値は、個体ごとに与えられた値です。同じポケモンでも強さが異なるのを表してます。努力値は、後天的に与えられた値です。個体値は、生まれた時に決まるのに対して、努力値は、戦闘によってあげることができます。今回は、pythonで種族値からポケモンのタイプを判定していきたいと思います。

< 能力値算出の計算式 >
・HPの能力値 = (種族値×2＋個体値＋努力値÷4)×レベル÷100＋レベル＋10
・HP以外の能力値 = (種族値×2＋個体値＋努力値÷4)×レベル÷100＋5}×性格補正

開発環境

CPU: 第8世代の1.4GHzクアッドコアIntel Core i5プロセッサ
OS: macOS
Visual Studio Code
Python 3.7.3 64-bit (base: conda)

まずやったこと

「ポケモン機械学習」で検索したら、似たようなことをやっているサイトがあったので参考にさせていただきました。https://www.hands-lab.com/tech/entry/3991.html
このサイトでは、種族値から、みずタイプかどうかを判定するのをやっていたので、とりあえずコピペで実装してみました。**85.3%**の精度で判定してるので、成功かと思ったのですが、実際に、みずタイプと判定されたのは、みずタイプではない「ラッキー」と「ハピナス」だけでした。

ここで、状況を整理します。全ポケモンは909種類で、みずタイプのポケモンは123種類です。みずタイプでないポケモンは785種類です。ここで、仮にどんな種族値を入れても、みずタイプではないと判定するモデルを想定します。このモデルの正答率は、785/909×100 = **86.5[%]**となります。

つまり、二値分類の問題では、対象となる二つの分類のサンプル数が同じくらいにしないとおかしな結果になるということがわかります。

次にやったこと

反省を踏まえて、分類したい二つの対象のサンプル数を同じくらいにしました。今回は、はがねタイプかでんきタイプか判定するモデルを作成しようと思います。（はがねタイプ：58匹、でんきタイプ：60匹）今回、レアコイルのようにでんきとはがねタイプを持つポケモンは、はがねタイプとしてカウントしました。ポケモンのデータは、こちらからお借りしました。

# %%
import pandas as pd
import codecs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

with codecs.open("data/pokemon_status.csv", "r", "Shift-JIS", "ignore") as file:
    df = pd.read_table(file, delimiter=",")

df.info()


# %%
metal1 = df[df['タイプ１'] == "はがね"]
metal2 = df[df['タイプ２'] == "はがね"]
metal = pd.concat([metal1, metal2])
print("鋼タイプのポケモン: %d匹" % len(metal))

elec1 = df[df['タイプ１'] == "でんき"]
elec2 = df[df['タイプ２'] == "でんき"]
elec = pd.concat([elec1, elec2])
print("電気タイプのポケモン: %d匹" % len(elec))


def type_to_num(p_type):
    if p_type == "はがね":
        return 0
    else:
        return 1


pokemon_m_e = pd.concat([metal, elec], ignore_index=True)
type1 = pokemon_m_e["タイプ１"].apply(type_to_num)
type2 = pokemon_m_e["タイプ２"].apply(type_to_num)
pokemon_m_e["type_num"] = type1*type2
pokemon_m_e.head()


# %%
X = pokemon_m_e.iloc[:, 7:13].values
y = pokemon_m_e["type_num"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)
lr = LogisticRegression(C=1.0)
lr.fit(X_train, y_train)


# %%
print("trainデータに対するscore: %.3f" % lr.score(X_train, y_train))
print("testデータに対するscore: %.3f" % lr.score(X_test, y_test))


# %%
i = 0
error1 = 0
success1 = 0
error2 = 0
success2 = 0
print("[はがねタイプと判断したポケモン一覧]")
print("----------------------------------------")
print("")
while i < len(pokemon_m_e):
    y_pred = lr.predict(X[i].reshape(1, -1))
    if y_pred == 0:
        print(pokemon_m_e.loc[i, ["ポケモン名"]])
        if pokemon_m_e.loc[i, ["type_num"]].values == 0:
            success1 += 1
            print("はがねタイプです")
            print("")
        else:
            error1 += 1
            print("はがねタイプではないです")
            print("")
    else:
        if pokemon_m_e.loc[i, ["type_num"]].values == 0:
            error2 += 1
        else:
            success2 += 1
    i += 1
print("----------------------------------------")
print("正しくはがねタイプと判断したポケモンの数: %d匹" % success1)
print("正しくでんきタイプと判断したポケモンの数: %d匹" % success2)
print("誤ってはがねタイプと判断したポケモンの数: %d匹" % error1)
print("誤ってでんきタイプと判断したポケモンの数: %d匹" % error2)
print("")

実行結果

trainデータに対するscore: 0.732 testデータに対するscore: 0.861

正しくはがねタイプと判断したポケモンの数: 48匹
正しくでんきタイプと判断したポケモンの数: 43匹
はがねタイプではないのにはがねタイプと判断したポケモンの数: 13匹
はがねタイプなのにはがねタイプと判断されなかったポケモンの数: 14匹

意外と、正しく判定されていたので概ね成功したのかなって思います。ロトムがはがねタイプと判定されていたけど（笑）。

さらにやったこと

上の例では、でんきタイプとはがねタイプを比較していました。ポケモンのタイプは全部で18種類ありますが、どの組み合わせが、一番判定の精度がよくなるか試してみたいと思います。

# %%
import pandas as pd
import codecs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

with codecs.open("data/pokemon_status.csv", "r", "Shift-JIS", "ignore") as file:
    df = pd.read_table(file, delimiter=",")

df.info()


# %%
def lr_model_pokemon(type1, type2, test_size=0.3, random_state=0, C=1.0):
    df_type1_1 = df[df['タイプ１'] == type1]
    df_type2_1 = df[df['タイプ２'] == type1]
    df_type_1 = pd.concat([df_type1_1, df_type2_1])

    df_type1_2 = df[df['タイプ１'] == type2]
    df_type2_2 = df[df['タイプ２'] == type2]
    df_type_2 = pd.concat([df_type1_2, df_type2_2])

    def type_to_num(p_type):
        if p_type == type1:
            return 0
        else:
            return 1

    pokemon_concat = pd.concat([df_type_1, df_type_2], ignore_index=True)
    type_num1 = pokemon_concat["タイプ１"].apply(type_to_num)
    type_num2 = pokemon_concat["タイプ２"].apply(type_to_num)
    pokemon_concat["type_num"] = type_num1 * type_num2

    X = pokemon_concat.iloc[:, 7:13].values
    y = pokemon_concat["type_num"].values

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state)
    lr = LogisticRegression(C=C)
    lr.fit(X_train, y_train)

    return [lr.score(X_train, y_train), lr.score(X_test, y_test)]


# %%
max_score_train = 0
max_score_test = 0
train_type1 = ""
test_type1 = ""
train_type2 = ""
test_type2 = ""
type_list = ["くさ", "ほのお", "みず", "むし", "ノーマル", "あく", "いわ", "はがね",
             "でんき", "ゴースト", "ドラゴン", "エスパー", "かくとう", "どく", "フェアリー", "じめん", "ひこう", "こおり"]

for type1 in type_list:
    for type2 in type_list:
        if type1 == type2:
            continue
        score = lr_model_pokemon(type1=type1, type2=type2)
        if (score[0] >= max_score_train):
            max_score_train = score[0]
            train_type1 = type1
            train_type2 = type2
        if (score[1] >= max_score_test):
            max_score_test = score[1]
            test_type1 = type1
            test_type2 = type2

print("%s, %sのとき、訓練データに対するスコアが最大になる: score = %.3f" %
      (train_type1, train_type2, max_score_train))
print("%s, %sのとき、テストデータに対するスコアが最大になる: score = %.3f" %
      (test_type1, test_type2, max_score_test))

実行結果

はがね, ノーマルのとき、訓練データに対するスコアが最大になる: score = 0.942 はがね, ノーマルのとき、テストデータに対するスコアが最大になる: score = 0.962

はがねタイプとノーマルタイプを判別するモデルの精度が一番高くなるようです。では、実際に、どういう判別がされているか実際にみてみます。

# %%
def poke_predict(type1, type2):
    type1_1 = df[df['タイプ１'] == type1]
    type2_1 = df[df['タイプ２'] == type1]
    type_1 = pd.concat([type1_1, type2_1])
    print("%sタイプのポケモン: %d匹" % (type1, len(type_1)))

    type1_2 = df[df['タイプ１'] == type2]
    type2_2 = df[df['タイプ２'] == type2]
    type_2 = pd.concat([type1_2, type2_2])
    print("%sタイプのポケモン: %d匹" % (type2, len(type_2)))

    def type_to_num(p_type):
        if p_type == type1:
            return 0
        else:
            return 1

    poke_concat = pd.concat([type_1, type_2], ignore_index=True)
    type1_c = poke_concat["タイプ１"].apply(type_to_num)
    type2_c = poke_concat["タイプ２"].apply(type_to_num)
    poke_concat["type_num"] = type1_c*type2_c
    poke_concat.head()

    X = poke_concat.iloc[:, 7:13].values
    y = poke_concat["type_num"].values

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=0)
    lr = LogisticRegression(C=1.0)
    lr.fit(X_train, y_train)

    i = 0
    error1 = 0
    success1 = 0
    error2 = 0
    success2 = 0
    print("")
    print("[%sタイプと判断したポケモン一覧]" % type1)
    print("----------------------------------------")
    print("")
    while i < len(poke_concat):
        y_pred = lr.predict(X[i].reshape(1, -1))
        if y_pred == 0:
            print(poke_concat.loc[i, ["ポケモン名"]])
            if poke_concat.loc[i, ["type_num"]].values == 0:
                success1 += 1
                print("%sタイプです" % type1)
                print("")
            else:
                error1 += 1
                print("%sタイプではないです" % type1)
                print("")
        else:
            if poke_concat.loc[i, ["type_num"]].values == 0:
                error2 += 1
            else:
                success2 += 1
        i += 1
    print("----------------------------------------")
    print("正しく%sタイプと判断したポケモンの数: %d匹" % (type1, success1))
    print("正しく%sタイプと判断したポケモンの数: %d匹" % (type2, success2))
    print("誤って%sタイプと判断したポケモンの数: %d匹" % (type1, error1))
    print("誤って%sタイプと判断したポケモンの数: %d匹" % (type2, error2))
    print("")


# %%
poke_predict("はがね", "ノーマル")

実行結果

はがねタイプのポケモン: 58匹ノーマルタイプのポケモン: 116匹

正しくはがねタイプと判断したポケモンの数: 50匹
正しくノーマルタイプと判断したポケモンの数: 115匹
誤ってはがねタイプと判断したポケモンの数: 1匹
誤ってノーマルタイプと判断したポケモンの数: 8匹

サンプル数に差があるとはいえ、精度が94.8%なのは、かなりいい精度であると言えるのではないでしょうか。この結果から、ノーマルタイプとはがねタイプは、種族値の特徴が異なるといえます。

終わりに

機械学習を学び始めて、まだ一週間も経っていない初心者なのですが、割と深い考察ができたのではないかと思ってます。この記事に、間違った考えをしているところがあれば、指摘していただけると助かります。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up