0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

titanic挑戦3

Last updated at Posted at 2019-08-02

機械学習の精度を上げることを目的としたい
前書いた記事から導かれた仮説が
・PaseengerIdを除くと良いかも
・FamilySizeを入れると良いかも
・Parch,SibSpが0の人はSurvived率が低いかも
・0~10歳は生存者が多いかも
の4つだった。この仮説を検証する

1,処理を関数化
今回は処理があまり大きくないため、
・トレーニングデータとテストデータの前処理用の関数
・トレーニングデータの評価用の関数
・テストデータ予測用の関数
・特徴量の重要度を出力
の4つを関数化する

ソースコードが以下

%matplotlib inline
import warnings
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib.pylab as plt

from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold

warnings.filterwarnings('ignore')


def validate(train_x, train_y):
    accuracies = []
    feature_importances = []

    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
    for train_idx, test_idx in cv.split(train_x, train_y):
        trn_x = train_x.iloc[train_idx, :]
        val_x = train_x.iloc[test_idx, :]

        trn_y = train_y.iloc[train_idx]
        val_y = train_y.iloc[test_idx]

        clf = xgb.XGBClassifier()
        clf.fit(trn_x, trn_y)

        pred_y = clf.predict(val_x)
        feature_importances.append(clf.feature_importances_)
        accuracies.append(accuracy_score(val_y, pred_y))
    print(np.mean(accuracies))
    return accuracies, feature_importances


def plot_feature_importances(feature_importances, cols):
    df_fimp = pd.DataFrame(feature_importances, columns=cols)
    df_fimp.plot(kind="box", rot=90)


def preprocess_df(df):
    # CabinはこのあとDropするので、コードから削除
    df["Age"] = df["Age"].fillna(df["Age"].mean())
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode())
    
    # 列の削除
    df.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)
    
    # Sexの01化とEmbarkedのダミー化 
    df["Sex"] = df["Sex"].replace({"male": 0, "female": 1})
    df = pd.get_dummies(df)

    return df


# test dataのpredict
def predict_df(train_x, train_y, test_x, df_test_raw, path_output="result.csv"):
    clf = xgb.XGBClassifier()
    clf.fit(train_x, train_y)
    preds = clf.predict(test_x)
    
    _df = pd.DataFrame()
    _df["PassengerId"] = df_test_raw["PassengerId"]
    _df["Survived"] = preds
    _df.to_csv(path_output, index=False)


# デバッグするときはmain関数から外して、直で叩く方が楽です。
def main():
    df_train = pd.read_csv("train.csv")

    # ここは前処理
    train_y = df_train["Survived"]
    train_x = df_train.drop("Survived", axis=1)

    train_x = preprocess_df(train_x)
    accuracies, feature_importances = validate(train_x, train_y)
    plot_feature_importances(feature_importances, train_x.columns)

    flag_product = True
    if flag_product:
        df_test = pd.read_csv("test.csv")
        df_test_raw = df_test.copy()
        test_x = preprocess_df(df_test)
        predict_df(train_x, train_y, test_x, df_test_raw, "result.csv")


#  `if __name__ == '__main__':` はおまじないのようなモノと思ってください。
if __name__ == '__main__':
    main()

結果

0.8103254769921436

image.png
仮説検証1、PassengerIdを除く
列を除くにはdf.drop(列名,axis=1,inplace=True)でOK

def preprocess_df(df):
    # CabinはこのあとDropするので、コードから削除
    df["Age"] = df["Age"].fillna(df["Age"].mean())
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode())
    
    # 列の削除
    df.drop(["Name", "Ticket", "Cabin", "PassengerId"], axis=1, inplace=True)
    
    # Sexの01化とEmbarkedのダミー化 
    df["Sex"] = df["Sex"].replace({"male": 0, "female": 1})
    df = pd.get_dummies(df)

    return df

if __name__ == '__main__':
    main()

結果

0.8226711560044894

image.png
cvの値が上がることを確認できる

2、仮説検証-FamilySize列を作成
次にSibSpとParchの数を足してFamilySizeという列をつくってみる
preprocess_dfの中身だけ書き換えればok

def preprocess_df(df):
    # CabinはこのあとDropするので、コードから削除
    df["Age"] = df["Age"].fillna(df["Age"].mean())
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode())
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
   
    # 列の削除
    df.drop(["Name", "Ticket", "Cabin", "PassengerId"], axis=1, inplace=True)
    
    # Sexの01化とEmbarkedのダミー化 
    df["Sex"] = df["Sex"].replace({"male": 0, "female": 1})
    df = pd.get_dummies(df)

    return df

if __name__ == '__main__':
    main()

結果

0.8237934904601572

image.png
cvの平均値が上がることがわかる

3,Parch,SibSpが0の人のフラグを立てる
Parch,SibSpが0の人のフラグを立てる
先ほどFamilySizeという変数も作ったのでこちらの値が1の人も別途フラグを立ててみる

def preprocess_df(df):
    # CabinはこのあとDropするので、コードから削除
    df["Age"] = df["Age"].fillna(df["Age"].mean())
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode())
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
   
    # 列の削除
    df.drop(["Name", "Ticket", "Cabin", "PassengerId"], axis=1, inplace=True)

    # Sexの01化とEmbarkedのダミー化 
    df["Sex"] = df["Sex"].replace({"male": 0, "female": 1})
    df = pd.get_dummies(df)
    
    # Parch, SibSp, FamilySize関連のFlag
    df["None_Parch"] = [1 if val == 0 else 0 for val in df["Parch"]]
    df["None_SibSp"] = [1 if val == 0 else 0 for val in df["SibSp"]]
    df["None_Family"] = [1 if val == 1 else 0 for val in df["FamilySize"]]

    return df

if __name__ == '__main__':
    main()

結果

0.819304152637486

image.png
特徴量の重要度としても0なので全く使われていないことがわかる。今回これらは特徴量に入れなくてよさそう

4、0~10歳は生存者が多いかも

def preprocess_df(df):
    # CabinはこのあとDropするので、コードから削除
    df["Age"] = df["Age"].fillna(df["Age"].mean())
    df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode())
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
   
    # 列の削除
    df.drop(["Name", "Ticket", "Cabin", "PassengerId"], axis=1, inplace=True)

    # Sexの01化とEmbarkedのダミー化 
    df["Sex"] = df["Sex"].replace({"male": 0, "female": 1})
    df = pd.get_dummies(df)
    
    # Parch, SibSp, FamilySize関連のFlag
    df["Flag_Children"] = [1 if val < 11 else 0 for val in df["Age"]]

    return df

if __name__ == '__main__':
    main()

結果

0.819304152637486

image.png
今回はこれも特徴量にいれなくてよさそう

よって
PassengerIdを除くと良いかも、とFamily_sizeを入れると良いかもを採用する
(実際はクロスバリデーションで評価した値と実際のスコアは一致するとは限らない。今回の場合だと、Family_sizeを入れて計算した値をSubmitすると少しスコアが下がる)

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?