More than 3 years have passed since last update.

Kaggle初挑戦感想とタイタニック正答率81%の内容

Last updated at 2021-09-06Posted at 2021-08-15

Kaggleで入門である「タイタニックコンペティション」に挑戦しました。だいたい、目安である8割超えを達成しました。1から2ヶ月程度で終わらせようと思っていたのですが、勉強しつつやったら半年以上かかってしまいました。試したことがなかなか精度向上につながらず、想像以上に厳しかったです。

サマリ

いろいろなモデル試したり、ハイパーパラメータの調整していけば、ガンガン精度が上がるかと思っていましたが、まるでそんなことなかったです。一番、有効だったのは特徴量エンジニアリング。前処理は重要ですね。

期間: 7ヶ月程度。途中忙しかったり別の勉強をしたいた期間あり。
時間合計: 多分150時間くらい使っている気がします。今回計測していないので大きくずれている可能性あり。
進め方: 最初はKaggleに慣れながら、とにかくデータ探索。いろいろ特徴量エンジニアリングをして様々なモデルを試す。一通りやりきったら他のKernelを見ながら良さそうな点を取り入れてみる。都度、わからないことについては勉強・ブログ記事での整理を挟む。
前提知識: Pythonは仕事でたまに使う程度。過去にこんなことを勉強していました。
得た知識: DataFrameの使い方と前処理系

最終的に整理したのが以下のバージョンのNotebookです。訓練時にrandom_state指定をしていないので正答率が81%より少し下がっていますが、何回かやれば81%になると思います。

データ

データ概要

訓練データとテストデータの2種類があります。もうひとつ"gender_submission.csv"というデータもありますが、提出練習用のファイルです。提出の練習に使うことと、提出フォーマットを知る目的以外では意味がなさそうです。

訓練データ (train.csv)
テストデータ (test.csv)

列	変数	内容	Key情報など
1	PassengerId	乗客ID	ユニークキー
2	Survived	生死(目的変数)	0 = No(死亡), 1 = Yes(生存)。テストデータには存在しない列
3	Pclass	席等級	1 = 1st(Upper), 2 = 2nd(Middle), 3 = 3rd(Lower)
4	Name	姓名
5	Sex	性	male/female
6	Age	年齢	Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
7	SibSp	同乗した姉妹兄弟と配偶者数	Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
8	Parch	同乗した両親子ども数	Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
9	Ticket	チケット番号
10	Fare	料金	親子で2席買ったら合計料金になるっぽい
11	Cabin	キャビン番号	欠損値多い
12	Embarked	乗船港	C = Cherbourg, Q = Queenstown, S = Southampton

プログラム

0. パッケージ読込

パッケージ読込。主力はScikit-Learnです。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, learning_curve
from sklearn.metrics import classification_report, roc_curve, auc, precision_recall_curve, plot_roc_curve, plot_confusion_matrix

1. データ読込とデータ探索

1.1. データ読込と確認

ファイル読込とinfoでデータ概要を確認します。

train_csv = pd.read_csv("/kaggle/input/titanic/train.csv")
test_csv = pd.read_csv("/kaggle/input/titanic/test.csv")
print(train_csv.info())
print(test_csv.info())

訓練データは891レコードあります。少ないですね。
テストデータは417レコードあります。テストなので訓練と違ってSurvived列(回答)がないです。
両者ともいくつか欠損値があることがわかります。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

1.2. Survived(生死)確認

Survived(生死)をグラフ出力して確認します。

DICT_SURVIVED = {0: '0: Dead', 1: '1: Survived'}

def arrange_bar(ax, sr):
    ax.set_xticklabels(labels=ax.get_xticklabels(), rotation=30, horizontalalignment="center")
    ax.grid(axis='y', linestyle='dotted')
    [ax.text(i, count, count, horizontalalignment='center') for i, count in enumerate(sr)]

sr_survived = train_csv['Survived'].value_counts().rename(DICT_SURVIVED)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 3))
fig.subplots_adjust(wspace=0.5, hspace=0.5)
sr_survived.plot.pie(autopct="%1.1f%%", ax=axes[0])
sr_survived.plot.bar(ax=axes[1])

arrange_bar(axes[1], sr_survived)

plt.show()

生存率は38.4%。

1.3. グラフ出力関数の定義

グラフ出力する関数を定義しておきます。何かの特徴量ごとの生死を2行2列のマトリックスでグラフ出力します。
別記事「【入門者向け】機械学習前にデータ探索・可視化でデータを理解(pandasとmatplotlib)」で詳しく書きました。

def arrange_stack_bar(ax):
    ax.set_xticklabels(labels=ax.get_xticklabels(), rotation=30, horizontalalignment="center")
    ax.grid(axis='y', linestyle='dotted')

def output_bars(df, column, index={}):    
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
    fig.subplots_adjust(wspace=0.5, hspace=0.5)    

    # Key-Valueラベルなしの場合
    if len(index) == 0:
        df_vc = df.groupby([column])["Survived"].value_counts(
            sort=False).unstack().rename(columns=DICT_SURVIVED)
        df[column].value_counts().plot.pie(ax=axes[0, 0], autopct="%1.1f%%")
        df.groupby([column])["Survived"].value_counts(
            sort=False, normalize=True).unstack().rename(columns=DICT_SURVIVED).plot.bar(ax=axes[1, 1], stacked=True)
    
    # Key-Valueラベルありの場合
    else:
        df_vc = df.groupby([column])["Survived"].value_counts(
            sort=False).unstack().rename(index=index, columns=DICT_SURVIVED)
        df[column].value_counts().rename(index).plot.pie(ax=axes[0, 0], autopct="%1.1f%%")
        df.groupby([column])["Survived"].value_counts(
            sort=False, normalize=True).unstack().rename(index=index, columns=DICT_SURVIVED).plot.bar(ax=axes[1, 1], stacked=True)   

    df_vc.plot.bar(ax=axes[1, 0])

    for rect in axes[1, 0].patches:
        height = rect.get_height()
    
        # https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py
        axes[1, 0].annotate('{:.0f}'.format(height),
                        xy=(rect.get_x() + rect.get_width() / 2, height),
                        xytext=(0, 3),  # 3 points vertical offset
                        textcoords="offset points",
                        ha='center', va='bottom')

    df_vc.plot.bar(ax=axes[0, 1], stacked=True)

    arrange_stack_bar(axes[0, 1])
    arrange_stack_bar(axes[1, 0])
    arrange_stack_bar(axes[1, 1])

    # データラベル追加
    [axes[0, 1].text(i, item.sum(), item.sum(), horizontalalignment='center') 
     for i, (_, item) in enumerate(df_vc.iterrows())]

    plt.show()

1.4. Pclass(席等級) グラフ出力

先程定義した関数output_barsを使ってPclass(席等級) をグラフ出力します。

DICT_PCLASS = {1: '1: 1st(Upper)', 2: '2: 2nd(Middle)', 3: '3: 3rd(Lower)'}
output_bars(train_csv, 'Pclass', DICT_PCLASS)

等級が高い席ほど生存率が高い。金持ちは助かりやすいということみたいです。

1.5. Sex(性) グラフ出力

Sex(性) のグラフ出力。

output_bars(train_csv, 'Sex')

男性の方が女性より総数が多いが、生存者は女性の方が多い。

1.6. Embarked(乗船港) グラフ出力

Embarked(乗船港) のグラフ出力。

DICT_EMBARK = {'C': 'Cherbourg', 'Q': 'Queenstown', 'S': 'Southampton'}
output_bars(train_csv, 'Embarked', DICT_EMBARK)

理由まで確認していないが、乗船港によって生存率が少し変わる。ただ、大きく変化ない。学習後にFeature Importance見るとあまり重視されていない特徴量だった。他特徴量とのマルチコなのかもしれない。

1.7. Age(年齢) グラフ出力

Age(年齢) グラフ出力。ヒストグラム出力の関数も含みます。
ヒストグラム出力も別記事「【入門者向け】機械学習前にデータ探索・可視化でデータを理解(pandasとmatplotlib)」で詳しく書きました。

# 欠損値の扱い: 除去されている
def output_box_hist(column, bins=20, query=None):
    if query == None:
        fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))
    else:
        fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))
        train_csv.query(query)[column].hist(ax=axes[2, 0], bins=bins)
        train_csv.query(query).groupby('Survived')[column].plot.hist(
        ax=axes[2, 1], bins=bins, alpha=0.5, legend=True, grid=True)
        axes[2, 1].legend(labels=[DICT_SURVIVED[int(float((text.get_text())))] for text in axes[2, 1].get_legend().get_texts()])

    fig.subplots_adjust(wspace=0.5, hspace=0.5)

    train_csv.boxplot(ax=axes[0, 0], column=[column])
    train_csv.boxplot(ax=axes[0, 1], column=[column], by='Survived')
    axes[0, 1].set_xticklabels([DICT_SURVIVED[int(float(xticklabel.get_text()))] for xticklabel in axes[0, 1].get_xticklabels()])
    train_csv[column].hist(ax=axes[1, 0], bins=bins)
    train_csv.groupby('Survived')[column].plot.hist(ax=axes[1, 1], bins=bins, alpha=0.5, grid=True, legend=True)
    axes[1, 1].legend(labels=[DICT_SURVIVED[int(float((text.get_text())))] for text in axes[1, 1].get_legend().get_texts()])

    plt.show()

output_box_hist('Age')

10代前半までは生存率が高い。一見すると役に立ちそうな特徴量だが、結果的に使わなかった(使うと精度落ちたため)。Binning をしているkernelもあったが、今回はそこまでしなかった。

1.8. SibSp(同乗した兄弟姉妹と配偶者数) グラフ出力

SibSp(同乗した兄弟姉妹と配偶者数) グラフ出力。Siblings と Spouse。

output_bars(train_csv, 'SibSp')

2人以上はそもそもレコード数が少ない。人数が多いと生存率が低くなっていく。

1.9. Parch(同乗した両親子どもの数) グラフ出力

Parch(同乗した両親子どもの数)グラフ出力。 ParentsとChildren。

output_bars(train_csv, 'Parch')

3人以上はそもそもレコード数が少なく、全般的にまばら。

1.10. Fare(料金) グラフ出力

Fare(料金) グラフ出力。3行目は200以下のみを出力。

output_box_hist('Fare', 20, 'Fare < 200')

25以下だと死亡率が高い。使うと精度が下がったので、最終的に使わなかった特徴量。

1.11. 相関行列出力

ワンライナーで相関行列出力。数値系の特徴量だけを対象にしています。Sex(性)などがないのは少し残念。

train_csv.loc[:, ["Survived", "Pclass", "Age", "SibSp", "Parch", "Fare"]].corr().style.background_gradient(axis=None)

Survived(生死)と最も相関係数が高いのはPclass(席等級)。以外とFare(料金)も相関係数高い。

2. 特徴量生成

「1. データ読込とデータ探索」では、与えられた特徴量をそのまま見てきました。「2. 特徴量生成」では、新たな特徴量を作ります。
あまり良くないですが、訓練データとテストデータをくっつけます。手早いので・・・

both = pd.concat([train_csv, test_csv], ignore_index=True)

2.1. Grouping

家族や同乗者は生死を道連れにする、という理屈でグループを分けます。リンク先Kernelを参考にしました。
以下の2種類の基準でグルーピングします。

姓 + Fare
Ticket(下一桁はグルーピングに入れない方がいいという情報もあったが、そこまで追求しない。)

記事「Target Encodingで精度向上させた例(Leave One Out)」にグルーピングの内容を記載しています。

2.1.1. 「姓 + Fare」作成

以下の手順でグルーピングの基準となる「姓 + Fare」の列を作成。

特徴量 Name から姓を抽出し新特徴量"Last_Name"作成
"Last_Name"と"Fare"を文字列結合して"Name_Fare"作成

# Last Name作成
both['Last_Name'] = both['Name'].apply(lambda x: str.split(x, ",")[0])

# FareがNullのレコードは文字列"nan"となる
both['Name_Fare'] = both['Last_Name'] + both['Fare'].astype('str')

2.1.2. グルーピング

再帰処理で以下の2つの基準でグルーピング。再帰処理はあまりやらないので、コード書くのに時間かかりました。

姓＋ Fare
Ticket

# 姓とFareでグルーピング
def process_name(i, name_fare):
    tickets = both.loc[(both['Name_Fare'] == name_fare) & (both['Group'].isnull()),'Ticket'].unique().tolist()
    both.loc[(both['Name_Fare'] == name_fare) & (both['Group'].isnull()), 'Group'] = i
    for ticket in tickets:
        process_ticket(i, ticket)

# チケットでグルーピング        
def process_ticket(i, ticket):
    name_fares = both.loc[(both['Ticket'] == ticket) & (both['Group'].isnull()),'Name_Fare'].unique().tolist()
    both.loc[(both['Ticket'] == ticket) & (both['Group'].isnull()), 'Group'] = i
    for name_fare in name_fares:
        process_name(i, name_fare)

both['Group'] = None

# チケットでグルーピング(再帰処理で姓とFare、チケットでのグルーピングを繰り返す)
[process_ticket(i, ticket) for i, ticket in enumerate(both['Ticket'].unique().tolist())]

print('Ticket Count', both['Ticket'].nunique())
print('Name & Fare Count', both['Name_Fare'].nunique())
print('Ticket & Name Count', both['Group'].nunique())

うまくグルーピングできてます。

print結果

Ticket Count 929
Name & Fare Count 982
Ticket & Name Count 887

2.2. タイトル

MrやMissなどタイトルをNameから抽出。そのまま使うと非常に細かすぎるのでMaster, Mr, Miss, Mrs の4種類に集約。
そのまま使った場合のグラフ表示は別記事「【入門者向け】機械学習前にデータ探索・可視化でデータを理解(pandasとmatplotlib)」の中で紹介しています。

各タイトルについて調べました。参考まで。

Title	性	婚姻	内容
Mr.	男	-	ミスター
Mrs.	女	既婚	ミスィズ/ミズィズ
Miss.	女	未婚	ミス
Ms.	女	-	ミズ
Master.	男	未婚	少年ないし青年男性。スターウォーズのヨダみたいなイメージかと思っていたら違った
Dr.	男	-	ドクター(昔の話なので男と断定)
Rev.	男		Reverendの略。聖職者。牧師のイメージ(昔の話なので既婚と断定)
Col.	女		colonelの略で大佐(昔のことなので男と断定)。
Mme.	女	既婚	マダム(未婚の場合もあるらしいが昔の話なので既婚と断定)
Capt.	男	-	キャプテン
Countess.	女	既婚	伯爵。スターウォーズのCount.ドゥークーの伯爵の女性用
Jonkheer.	男	-	貴族
Major.	男	-	少佐(昔の話なので男と断定)
Mlle.	女	未婚	mademoiselle(マドモワゼル)の略でお嬢さん。若いので未婚でしょう
Don.	男	-	貴人・高位聖職者に対する尊称。
Dona.	女	-	Don.の女性版
Sir.	男	-	爵位を持つ貴族の男性(詳しく調べていない)
Lady.	女	既婚	Sir.の妻(詳しく調べていない)

マダム(Wikipedia)

2.2.1. 特徴量 Name から Titleを抽出

特徴量 Name から Titleを抽出して Title 列に設定します。

titles = ["Mr.", "Miss.", "Mrs.", "Master.", "Dr.", "Rev.", "Col.", "Ms.", 
          "Mlle.", "Mme.", "Capt.", "Countess.", "Major.", "Jonkheer.", "Don.", 
         "Dona.", "Sir.", "Lady."]

# Titleを抽出
for title in titles:
    both.loc[both.Name.str.contains(title, regex=False), 'Title'] = title
    
print(both.loc[:,['Name', 'Title']])

うまく抽出できています。

print結果

                                                   Name    Title
0                               Braund, Mr. Owen Harris      Mr.
1     Cumings, Mrs. John Bradley (Florence Briggs Th...     Mrs.
2                                Heikkinen, Miss. Laina    Miss.
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)     Mrs.
4                              Allen, Mr. William Henry      Mr.
...                                                 ...      ...
1304                                 Spector, Mr. Woolf      Mr.
1305                       Oliva y Ocana, Dona. Fermina    Dona.
1306                       Saether, Mr. Simon Sivertsen      Mr.
1307                                Ware, Mr. Frederick      Mr.
1308                           Peter, Master. Michael J  Master.

2.2.2. Title集約

列 Title をMaster, Mr, Miss, Mrs の4種類に分類し、列 NewTitleに設定。

both.loc[both['Title']=='Master.', 'NewTitle'] = 'Master'
both.loc[(both['Sex']=='male')&(both['NewTitle']!='Master'), 'NewTitle'] = 'Mr'
both.loc[(both['Title']=='Mlle.')|((both['Title']=='Ms.')|(both['Title']=='Miss.')), 'NewTitle'] = 'Miss'
both.loc[(both['Sex']=='female')&(both['NewTitle']!='Miss'), 'NewTitle'] = 'Mrs'

2.3. Family Size

家族合計人数を出します。
SibSp(兄弟姉妹と配偶者数) と Parch(両親子ども数) の合計値として列 FamiliSizeに設定。

both['FamilySize'] = both['SibSp'] + both['Parch']

3. 特徴量変換と欠損値補完

様々なEncodingで特徴量を変換し、Embarked(乗船港)。今回は決定木系を使っているのでFeature Scalingをしていません。
最初に使う特徴量のみを新しいDataFrameへコピーします。

COPIED = ['PassengerId', 'Survived', 'Pclass', 'Sex', 'FamilySize', 'Embarked', 'Group', 'NewTitle']
train = both.loc[:890, COPIED].copy()
test = both.loc[891:, COPIED].copy()

3.1. Embarked の欠損値補完

Embarked(乗船港) の欠損値をSimpleImputerを使って最頻値で補完。欠損値があるのが2レコードのみなので簡易的で使いやすい最頻値を使っています。
※欠損値補完の詳細は記事「Pythonでの欠損値補完(代入法) scikit-learnとpandas」に書きました。

imputer = SimpleImputer(strategy='most_frequent')
train['Embarked'] = imputer.fit_transform(train.loc[:,['Embarked']])
test['Embarked'] = imputer.transform(test.loc[:,['Embarked']])

3.2. Embarked と NewTitle のOne-hot Encoding

Embarked(乗船港)とNewTitle(集約タイトル)をOne-hot Encoding。OneHotEncoder関数を使っています。マルチコを防ぐためdropにfirstを指定しています。
※One-hot Encodingについては記事「カテゴリ変数系特徴量の前処理(scikit-learnとcategory_encoders)」参照

oh_encoder = OneHotEncoder(sparse=False, drop='first')
oh_encoder.fit(train.loc[:,['Embarked', 'NewTitle']])
onehot = pd.DataFrame(oh_encoder.transform(train.loc[:,['Embarked', 'NewTitle']]), 
                        columns=oh_encoder.get_feature_names(),
                        index=train.index,
                        dtype=np.int8)
train = pd.concat([train, onehot], axis = 1)
onehot = pd.DataFrame(oh_encoder.transform(test.loc[:,['Embarked', 'NewTitle']]), 
                        columns=oh_encoder.get_feature_names(), 
                        index=test.index,
                        dtype=np.int8)
test = pd.concat([test, onehot], axis = 1)

3.3. Group のカウントエンコーディング

Group をカウントエンコードディング。
テストデータセットの場合は、自身のレコードとして1を追加(訓練データだけでなくテストデータのカウントもしたかったため)。本当は1以上あるかもしれないが雑に計算。
Scikit-Learn にはカウントエンコーディングの関数がなかったので、category_encodersのCound Encoder関数を使っています。
※別記事リンク

count_encoder = ce.CountEncoder(cols=['Group'], handle_unknown=0, return_df=True)
train['Group_Count'] = count_encoder.fit_transform(train['Group'])
test['Group_Count'] = count_encoder.transform(test['Group']).astype('int') + 1  #test自身のデータを1追加(本当はもっとあるかもしれないが雑に計算)

3.4. Group の Target Encoding

Group を Leave One Out でTarget Encoding。テストデータセットでGroup Count が0の場合は、平均生存率の0.384を設定。
※記事「Target Encodingで精度向上させた例(Leave One Out)」参照
※別記事リンク

te = ce.LeaveOneOutEncoder(cols=['Group'])
train['Group_Target'] = te.fit_transform(train['Group'], train['Survived'])
test['Group_Target'] = te.transform(test['Group'])

for index, row in test.query('Group_Count == 2').iterrows():
    test.at[index, 'Group_Target'] = train[train['Group']==row['Group']]['Survived']
test.loc[test['Group_Count'] == 0, 'Group_Target'] = 0.384

3.5. PclassとSexのラベルエンコーディング

Pclass(席等級)とSex(性)のラベルエンコーディング。
Pclass(席等級)についてはラベルエンコーディングとOne-Hot Encodingのどちらか迷いましたが、数値大小に意味を持たせるためにラベルエンコーディングにしました。
Sex(性)はLabelBinarizer関数を使っても同じですが、一度に複数特徴量を処理したかったので、OrdinalEncoder関数を使っています。
※別記事リンク

oe = OrdinalEncoder()
train.loc[:,['PclassEncoded', 'SexEncoded']] = oe.fit_transform(train[['Pclass', 'Sex']])
test.loc[:,['PclassEncoded', 'SexEncoded']] = oe.transform(test[['Pclass', 'Sex']])

4. 最終特徴量確認

特徴量を最終的に確認します。

4.1. 特徴量概要確認

info関数で特徴量概要を確認します。

print(train.info())
print(test.info())

欠損値がない状態になっています。

info結果

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PassengerId    891 non-null    int64  
 1   Survived       891 non-null    float64
 2   Pclass         891 non-null    int64  
 3   Sex            891 non-null    object 
 4   FamilySize     891 non-null    int64  
 5   Embarked       891 non-null    object 
 6   Group          891 non-null    object 
 7   NewTitle       891 non-null    object 
 8   x0_Q           891 non-null    int8   
 9   x0_S           891 non-null    int8   
 10  x1_Miss        891 non-null    int8   
 11  x1_Mr          891 non-null    int8   
 12  x1_Mrs         891 non-null    int8   
 13  Group_Count    891 non-null    int64  
 14  Group_Target   891 non-null    float64
 15  PclassEncoded  891 non-null    float64
 16  SexEncoded     891 non-null    float64
dtypes: float64(4), int64(4), int8(5), object(4)
memory usage: 88.0+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 891 to 1308
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PassengerId    418 non-null    int64  
 1   Survived       0 non-null      float64
 2   Pclass         418 non-null    int64  
 3   Sex            418 non-null    object 
 4   FamilySize     418 non-null    int64  
 5   Embarked       418 non-null    object 
 6   Group          418 non-null    object 
 7   NewTitle       418 non-null    object 
 8   x0_Q           418 non-null    int8   
 9   x0_S           418 non-null    int8   
 10  x1_Miss        418 non-null    int8   
 11  x1_Mr          418 non-null    int8   
 12  x1_Mrs         418 non-null    int8   
 13  Group_Count    418 non-null    int64  
 14  Group_Target   418 non-null    float64
 15  PclassEncoded  418 non-null    float64
 16  SexEncoded     418 non-null    float64
dtypes: float64(4), int64(4), int8(5), object(4)
memory usage: 41.4+ KB
None

4.2. 相関行列出力

PassengerID以外の相関行列を出力。

train.iloc[:, 1:].corr().style.background_gradient(axis=None)

一番目的変数と相関係数高いのはSex(性)ですね。

4.3. Grouping結果グラフ出力

特徴量生成したGroupingの結果をグラフ出力して確認。左から順に以下の内容。

何人のグループかを示す円グラフ
何人のグループかを示すヒストグラム
3人以上のグループの場合の生存率ヒストグラム(2人だとLeave One Outなので1か0になってしまうため除外)

_, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 4))
train['Group_Count'].value_counts().plot.pie(ax=axes[0], title='Group Count Pie Chart', autopct="%1.1f%%")
train['Group_Count'].plot.hist(ax=axes[1], title='Group Count Histgram')
train.query('Group_Count > 2')['Group_Target'].plot.hist(ax=axes[2], title='Survival Rate Histgram')
plt.show()

生存率(最も右の)ヒストグラムを見るとまぁまぁ偏ってくれています。

グループ人数ごとのヒストグラム。

_, ax = plt.subplots(figsize=(12, 10))
train.query('Group_Count > 2').Group_Target.hist(ax=ax, by=train['Group_Count'], range=(0, 1))
plt.show()

まぁまぁ0か1に偏ってくれているので、そこそこの特徴量になる。

4.4. NewTitle結果グラフ出力

特徴量生成したNewTitleの結果をグラフ出力して確認。

output_bars(train, 'NewTitle')

Mrの死亡率が高いのがわかります。今回はAgeを使っていないので、NewTitleで男性の成人と未成年を分けています。

4.5. FamilySize結果グラフ出力

特徴量生成したFamilySizeの結果をグラフ出力して確認。

output_bars(train, 'FamilySize')

1から3は生存率高く、サンプル少ないですが7以上だと生存者なしです。

5. 学習

GridSearchで学習します。最終的にRndomForestを使いました。

5.0. 学習関連グラフ出力関数

学習に関するグラフを出力する関数定義です。

5.0.1. 学習曲線

学習曲線(Learning Curve)を出力します。mlxtendのPlotting Learning Curvesを使っても良かったかも。

# 学習曲線出力
def output_learning_curve(ax, x_all, y_all, gscv):
    training_sizes, train_scores, test_scores = learning_curve(gscv.best_estimator_,
                                                               x_all, y_all, cv=5,
                                                               train_sizes=[0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
    ax.plot(training_sizes, train_scores.mean(axis=1), label="training scores")
    ax.plot(training_sizes, test_scores.mean(axis=1), label="test scores")
    ax.legend(loc="best")
    ax.set_title('Learning Curve')

5.0.2. 適合率-再現率グラフ

適合率-再現率グラフを出力します。plot_precision_recall_curve関数を使っても良かったかも。

# 適合率-再現率グラフ出力
def output_pr_curve(ax, y_test, y_pred):
    # ある閾値の時の適合率、再現率, 閾値の値を取得
    precisions, recalls, thresholds = precision_recall_curve(y_test, y_pred[:,1])

    # 0から1まで0.05刻みで○をプロット
    for i in range(21):
        close_point = np.argmin(np.abs(thresholds - (i * 0.05)))
        ax.plot(precisions[close_point], recalls[close_point], 'o')

    # 適合率-再現率曲線
    ax.plot(precisions, recalls)
    ax.set_xlabel('Precision')
    ax.set_ylabel('Recall')
    ax.set_title('Precision Recall Curve')

5.0.3. グラフ出力メイン関数

ここで学習曲線や適合率-再現率グラフ出力の関数を呼び出しています。他にもグラフを出力していて、2行3列で以下を出力。

1列目	2列目	3列目
Confusion Matrix	Grid Search結果	Feature Importance/Coef 分類器によっては非出力
ROC曲線	適合率-再現率グラフ	学習曲線

def output_graphs(gscv, X, X_test):

    fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 8))
    fig.subplots_adjust(wspace=0.5, hspace=0.5)
    
    # Confusion Matrix 出力
    plot_confusion_matrix(gscv.best_estimator_, X, y, ax=axes[0, 0], display_labels=['0: Dead', '1: Survived'])
    
    # Grid Searchの結果出力
    result = pd.DataFrame(gscv.cv_results_).set_index('params')
    result.sort_values('rank_test_score', ascending=False).head(10)['mean_test_score'].plot.barh(ax=axes[0, 1], grid=True)
    axes[0, 1].set_xlim(result['mean_test_score'].min() * 0.9, result['mean_test_score'].max() * 1.1)
    
    # Feature Importance 出力(分類器にあれば)
    axes[0, 2].set_title('Feature Importance/ Coef Top 10')
    if hasattr(gscv.best_estimator_, 'feature_importances_'):
        importances = pd.DataFrame({'Importance':gscv.best_estimator_.feature_importances_}, index=X_test.columns)
        importances.sort_values('Importance', ascending=False).head(10).sort_values('Importance', ascending=True).plot.barh(ax=axes[0, 2], grid=True)
    if hasattr(gscv.best_estimator_, 'coef_'):
        importances = pd.DataFrame({'Coef':gscv.best_estimator_.coef_[0]}, index=X_test.columns)
        importances.sort_values('Coef', ascending=False).head(10).sort_values('Coef', ascending=True).plot.barh(ax=axes[0, 2], grid=True)
    
    # ROC曲線出力
    plot_roc_curve(gscv.best_estimator_, X, y, ax=axes[1, 0]) 
    axes[1, 0].set_title('ROC(Receiver Operating Characteristic) Curve')

    # 適合率-再現率グラフ出力
    y_pred = gscv.best_estimator_.predict_proba(X)
    output_pr_curve(axes[1, 1], y, y_pred)

    # 学習曲線出力
    output_learning_curve(axes[1, 2], X, y, gscv)
    
    plt.show()

5.1. 学習

グリッドサーチで学習する関数です。この中で、先程のグラフ出力関数も呼び出しています。

def fit(classifier, parameters):
    gscv = GridSearchCV(classifier, parameters, cv=5)
    gscv.fit(X, y)
    
    print('Grid Search Best parameters:', gscv.best_params_)
    print('Grid Search Best validation score:', gscv.best_score_)
    print('Grid Search Best training score:', gscv.best_estimator_.score(X, y))
    
    predictions = gscv.best_estimator_.predict(X_test)
    
    X_pred = gscv.best_estimator_.predict(X)
    print('\n', classification_report(y, X_pred))
    
    output_graphs(gscv, X, X_test)
    
    return gscv

よくあるX, X_test, y の変数に必要なデータを移動します。

features = ["Group_Count", "Group_Target", 'PclassEncoded', 'SexEncoded', "FamilySize", 'x0_Q', 'x0_S', 'x1_Miss', 'x1_Mr', 'x1_Mrs']
y = train["Survived"].astype('int')
tmp_sr = test['PassengerId']
X = train[features]
X_test = test[features]

グリッドサーチするパラメータとともに関数にRandomForestClassifierを渡します。

%%time
parameters = [{'max_depth': np.arange(4, 15, 1), 'n_estimators': np.arange(60, 141, 20), 'criterion': ['gini', 'entropy']}]
rf_gscv = fit(RandomForestClassifier(), parameters)

結果は正答率約86%。訓練するたびに微妙に変わります。

結果

Grid Search Best parameters: {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 120}
Grid Search Best validation score: 0.8607996987006465
Grid Search Best training score: 0.8630751964085297

               precision    recall  f1-score   support

           0       0.86      0.94      0.89       549
           1       0.88      0.75      0.81       342

    accuracy                           0.86       891
   macro avg       0.87      0.84      0.85       891
weighted avg       0.86      0.86      0.86       891

5.2. 予測と結果出力

最後に予測をして、結果をcsv出力です。

predictions = rf_gscv.best_estimator_.predict(X_test)
output = pd.DataFrame({'PassengerId': tmp_sr, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)

没ネタ集

分類アルゴリズム

いろいろ試しましたが、結局最も良かったのが以外なことにRnadom Forest。こちらのVersionにあり。もちろん、決定木でない場合は、Feature Scalingしています。
少し特徴量など条件違いますが参考程度に。

分類アルゴリズム	訓練正答率	Validation正答率	テスト正答率	備考
RandomForest	86.08%	86.42%	81.10%	採用モデル
KNN	85.19%	81.26%	73.21%	意外と悪い。多分やり方の問題
ロジスティクス回帰	85.19%	84.97%	78.47%
多層パーセプトロン	84.96%	84.63%	79.19%
SVM	81.48%	80.58%	78.71%
ν-SVM	82.71%	82.94%	77.51%	SVMに比べて過学習を抑えているのがわかる
ナイーブベイズ	79.01%	79.01%	未計測	とにかく速い
Gradient Boosting	92.37%	86.30%	79.90%	訓練は最良だが過学習気味
HGB	89.34%	85.75%	80.38%	RandomForestに次ぐ2番手
Voting	88.44%	86.31%	78.95%
Stacking	90.80%	未計測	78.47%

Ageの欠損値補完

Pythonでの欠損値補完(代入法) scikit-learnとpandasで学んだIterativeImputer関数を使って欠損値補完を試しましたが精度向上せず。

from sklearn.impute import IterativeImputer

ite_imputer = IterativeImputer(RandomForestRegressor())
train_imputed = pd.DataFrame(ite_imputer.fit_transform(train[features]))
train_imputed.columns = features
train = train_imputed
test_imputed = pd.DataFrame(ite_imputer.transform(test[features]))
test_imputed.columns = features
test = test_imputed

特徴量選択

Wrapper Methodを使って特徴量選択を実施。結果はあまり変わらず。
※【入門者向け】特徴量選択の基本まとめ(scikit-learnときどきmlxtend)参照

EFS(Exhaustive Feature Selector)

EFS(Exhaustive Feature Selector)の場合のVersion。

efs = EFS(RandomForestClassifier(), min_features=8, max_features=11)
efs = efs.fit(X, y)

print('Best accuracy score: %.2f' % efs.best_score_)
print('Best subset:', efs.best_feature_names_)

efs_result = pd.DataFrame.from_dict(efs.get_metric_dict()).T
efs_result.sort_values('avg_score', inplace=True, ascending=False)

RFE(Recursive Feature Elimination)

RFE(Recursive Feature Elimination): 再帰的特徴量削減の場合のVersion。

selector = RFECV(GBC(), min_features_to_select=7, cv=5)

X_selected = pd.DataFrame(selector.fit_transform(X, y), 
                 columns=X.columns.values[selector.get_support()])

result = pd.DataFrame(selector.get_support(), index=X.columns.values, columns=['False: dropped'])
result['ranking'] = selector.ranking_

その他

このVersionで試しています。ただ、以下の内容が悪かったのではなく、私のやり方が悪かったのかもしれません。AgeのBinningで高精度の人もいます。

Ticketを頭1桁と数字部分に分けていろいろ試みましたが、全然精度上がりませんでした。
Cabinも頭1桁と数字部分に分けたりしていろいろ試みましたが、全然精度上がりませんでした。Cabinは欠損値も多いし意味ないかも。
AgeのBinningもしましたが、全然精度上がりませんでした。

参考としたカーネル

リンク集

コードそのものは見ていないが、いろいろなリンクがあり非常に参考になった。

KNNと丁寧な特徴量エンジニアリング

KNNで83%。「KNNでこの精度出るのか!?」と驚いた。丁寧な特徴量エンジニアリングをしているのが印象的。私のやったGroupingの参考。

woman-child-groupなどで85%

85%を叩き出しているkernel。WCG (woman-child-group)という概念で、おそらく私のやったグルーピングから成人男性を除去しているっぽい。

106

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up