LoginSignup
17
14

More than 1 year has passed since last update.

kaggle/titanic 欠損値の補完と特徴量エンジニアリング

Last updated at Posted at 2018-12-08

テーマ : kaggle/titanic における、特徴量エンジニアリングと欠損値の補完について

今回こちらの課題を行うにあたり、特徴量と欠損値に目を付けた予測モデルの開発を行った。

特徴量エンジニアリングは、機械学習モデルのパフォーマンスを向上させるために、特徴量とも呼ばれる追加の予測因子を構築してデータセットに追加する ... つまり、モデルの予測精度を高めるための次の最前線は、データセット自体を改善することである。

-

今回の特徴量

kaggle/titanic

  • PassengerId: 乗客のID
  • Survived: 生存(1)、死亡(0)
  • Pclass: 乗客の階級
  • Name: 乗客の名前
  • Sex: 性別
  • Age: 年齢
  • SibSp: タイタニック号に乗っていた兄弟、姉妹、義兄弟、義姉妹、夫、妻の数(自分を除く)
  • Parch: タイタニック号に乗っていた母親、父親、息子、娘の数
  • Ticket: チケットナンバー
  • Fare: 乗船料金
  • Cabin: キャビン番号
  • Embarked: 乗船場

予測方法

titanicの特徴量を考察して新しく特徴量を作りだし、trainデータを決定木で分類したものをリーフごとにロジスティック回帰で分類し、最後に勾配ブースティングにかけたモデルにより最終的な予測を行った。

-

はじめに

まずは、ライブラリのインポートとtrain/testデータの読み込みとデータの中身を確認。

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression  # as LR
import matplotlib.pyplot as plt
from sklearn.grid_search import GridSearchCV

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

#データの取得
train = pd.read_csv("~/train.csv")
test = pd.read_csv("~/test.csv")

train.head(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

特徴量の作成

head()でデータの中身を確認後、新しく作成できる特徴量を考察し、その結果家族の人数と旅行者が団体旅行者か単身旅行者かの特徴量を作り出せることを思い付いたので早速作成することにした

考察元のカラム名:
SibSp: タイタニック号に乗っていた兄弟、姉妹、義兄弟、義姉妹、夫、妻の数(自分を除く)
Parch: タイタニック号に乗っていた母親、父親、息子、娘の数

SibSp + Pearch + 1 を家族数として捉え(今回名前は考慮せず)
それを元に単体旅行者(1人)と団体旅行者(2人以上)を区別

#単身旅行者と団体旅行者の特徴量を新しく作成
train['Family_group'] = train.SibSp + train.Parch
test['Family_group'] = test.SibSp + test.Parch

#家族人数の特徴量を新しく作成
train['Family'] = train.SibSp + train.Parch + 1
test['Family'] = test.SibSp + test.Parch + 1

#Family_groupの特徴量をダミー変数に変換
train['Family_group'] = np.where(train['Family_group'] >= 1, 1, 0)
test['Family_group'] = np.where(test['Family_group'] >= 1, 1, 0)

train.head(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family_group Family
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 1 2
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1 2
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0 1
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 0 1
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 0 1
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S 1 5
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S 1 3
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C 1 2

脱出船に乗れる確率

旅行者のグループ分けを行なった後にそれらの特徴量を使用して脱出船のある場所まで着いた際に特定の人物が脱出船に乗れるかどうかを確率で求めた特徴量を作成する。

#単身旅行者の場合は確実に乗れるので確率は1とする。

#団体旅行者の中で敬称が無い人の脱出船に乗れる確率
escape_boarding_probability_average_train = 2 / ((sum(train['Family']) - sum(train['Family_group'] == 0))/sum(train['Family_group'] == 1))
escape_boarding_probability_average_test = 2 / ((sum(test['Family']) - sum(test['Family_group'] == 0))/sum(test['Family_group'] == 1))

print("団体旅行者の中で敬称が無い人の脱出船に乗れる確率(1/団体旅行者の中での平均家族数) : ")
print(str(escape_boarding_probability_average_train))
団体旅行者の中で敬称が無い人の脱出船に乗れる確率(1/団体旅行者の中での平均家族数) : 
0.6103448275862069

単身旅行者が脱出船に乗る順番が来た際の脱出船に乗れる確率は100%なので’1’を代入しそれ以外は一先ず’NaN’をセットしておく。

train['escape_boarding_probability_train'] = 1
test['escape_boarding_probability_test'] = 1

train['escape_boarding_probability_train'] = train['escape_boarding_probability_train'].replace(1, np.nan)
test['escape_boarding_probability_test'] = test['escape_boarding_probability_test'].replace(1, np.nan)

for i in range(1, 891):
    if (train['Family_group'][i] == 0):
        train['escape_boarding_probability_train'][i] = 1

for i in range(1, 418):
    if (test['Family_group'][i] == 0):
        test['escape_boarding_probability_test'][i] = 1        
train.head(100)        
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Family_group Family escape_boarding_probability_train
0 1 0 3 Braund, Mr. Owen Harris male 22.00 1 0 A/5 21171 7.2500 NaN S 1 2 NaN
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.00 1 0 PC 17599 71.2833 C85 C 1 2 NaN
2 3 1 3 Heikkinen, Miss. Laina female 26.00 0 0 STON/O2. 3101282 7.9250 NaN S 0 1 1.0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00 1 0 113803 53.1000 C123 S 1 2 NaN
4 5 0 3 Allen, Mr. William Henry male 35.00 0 0 373450 8.0500 NaN S 0 1 1.0
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 0 1 1.0
6 7 0 1 McCarthy, Mr. Timothy J male 54.00 0 0 17463 51.8625 E46 S 0 1 1.0
7 8 0 3 Palsson, Master. Gosta Leonard male 2.00 3 1 349909 21.0750 NaN S 1 5 NaN
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.00 0 2 347742 11.1333 NaN S 1 3 NaN
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.00 1 0 237736 30.0708 NaN C 1 2 NaN
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.00 1 1 PP 9549 16.7000 G6 S 1 3 NaN
11 12 1 1 Bonnell, Miss. Elizabeth female 58.00 0 0 113783 26.5500 C103 S 0 1 1.0
12 13 0 3 Saundercock, Mr. William Henry male 20.00 0 0 A/5. 2151 8.0500 NaN S 0 1 1.0
13 14 0 3 Andersson, Mr. Anders Johan male 39.00 1 5 347082 31.2750 NaN S 1 7 NaN
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.00 0 0 350406 7.8542 NaN S 0 1 1.0
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.00 0 0 248706 16.0000 NaN S 0 1 1.0
16 17 0 3 Rice, Master. Eugene male 2.00 4 1 382652 29.1250 NaN Q 1 6 NaN
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S 0 1 1.0
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.00 1 0 345763 18.0000 NaN S 1 2 NaN
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C 0 1 1.0
20 21 0 2 Fynney, Mr. Joseph J male 35.00 0 0 239865 26.0000 NaN S 0 1 1.0
21 22 1 2 Beesley, Mr. Lawrence male 34.00 0 0 248698 13.0000 D56 S 0 1 1.0
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.00 0 0 330923 8.0292 NaN Q 0 1 1.0
23 24 1 1 Sloper, Mr. William Thompson male 28.00 0 0 113788 35.5000 A6 S 0 1 1.0
24 25 0 3 Palsson, Miss. Torborg Danira female 8.00 3 1 349909 21.0750 NaN S 1 5 NaN
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.00 1 5 347077 31.3875 NaN S 1 7 NaN
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C 0 1 1.0
27 28 0 1 Fortune, Mr. Charles Alexander male 19.00 3 2 19950 263.0000 C23 C25 C27 S 1 6 NaN
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q 0 1 1.0
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S 0 1 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
70 71 0 2 Jenkin, Mr. Stephen Curnow male 32.00 0 0 C.A. 33111 10.5000 NaN S 0 1 1.0
71 72 0 3 Goodwin, Miss. Lillian Amy female 16.00 5 2 CA 2144 46.9000 NaN S 1 8 NaN
72 73 0 2 Hood, Mr. Ambrose Jr male 21.00 0 0 S.O.C. 14879 73.5000 NaN S 0 1 1.0
73 74 0 3 Chronopoulos, Mr. Apostolos male 26.00 1 0 2680 14.4542 NaN C 1 2 NaN
74 75 1 3 Bing, Mr. Lee male 32.00 0 0 1601 56.4958 NaN S 0 1 1.0
75 76 0 3 Moen, Mr. Sigurd Hansen male 25.00 0 0 348123 7.6500 F G73 S 0 1 1.0
76 77 0 3 Staneff, Mr. Ivan male NaN 0 0 349208 7.8958 NaN S 0 1 1.0
77 78 0 3 Moutal, Mr. Rahamin Haim male NaN 0 0 374746 8.0500 NaN S 0 1 1.0
78 79 1 2 Caldwell, Master. Alden Gates male 0.83 0 2 248738 29.0000 NaN S 1 3 NaN
79 80 1 3 Dowdell, Miss. Elizabeth female 30.00 0 0 364516 12.4750 NaN S 0 1 1.0
80 81 0 3 Waelens, Mr. Achille male 22.00 0 0 345767 9.0000 NaN S 0 1 1.0
81 82 1 3 Sheerlinck, Mr. Jan Baptist male 29.00 0 0 345779 9.5000 NaN S 0 1 1.0
82 83 1 3 McDermott, Miss. Brigdet Delia female NaN 0 0 330932 7.7875 NaN Q 0 1 1.0
83 84 0 1 Carrau, Mr. Francisco M male 28.00 0 0 113059 47.1000 NaN S 0 1 1.0
84 85 1 2 Ilett, Miss. Bertha female 17.00 0 0 SO/C 14885 10.5000 NaN S 0 1 1.0
85 86 1 3 Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... female 33.00 3 0 3101278 15.8500 NaN S 1 4 NaN
86 87 0 3 Ford, Mr. William Neal male 16.00 1 3 W./C. 6608 34.3750 NaN S 1 5 NaN
87 88 0 3 Slocovski, Mr. Selman Francis male NaN 0 0 SOTON/OQ 392086 8.0500 NaN S 0 1 1.0
88 89 1 1 Fortune, Miss. Mabel Helen female 23.00 3 2 19950 263.0000 C23 C25 C27 S 1 6 NaN
89 90 0 3 Celotti, Mr. Francesco male 24.00 0 0 343275 8.0500 NaN S 0 1 1.0
90 91 0 3 Christmann, Mr. Emil male 29.00 0 0 343276 8.0500 NaN S 0 1 1.0
91 92 0 3 Andreasson, Mr. Paul Edvin male 20.00 0 0 347466 7.8542 NaN S 0 1 1.0
92 93 0 1 Chaffee, Mr. Herbert Fuller male 46.00 1 0 W.E.P. 5734 61.1750 E31 S 1 2 NaN
93 94 0 3 Dean, Mr. Bertram Frank male 26.00 1 2 C.A. 2315 20.5750 NaN S 1 4 NaN
94 95 0 3 Coxon, Mr. Daniel male 59.00 0 0 364500 7.2500 NaN S 0 1 1.0
95 96 0 3 Shorney, Mr. Charles Joseph male NaN 0 0 374910 8.0500 NaN S 0 1 1.0
96 97 0 1 Goldschmidt, Mr. George B male 71.00 0 0 PC 17754 34.6542 A5 C 0 1 1.0
97 98 1 1 Greenfield, Mr. William Bertram male 23.00 0 1 PC 17759 63.3583 D10 D12 C 1 2 NaN
98 99 1 2 Doling, Mrs. John T (Ada Julia Bone) female 34.00 0 1 231919 23.0000 NaN S 1 2 NaN
99 100 0 2 Kantor, Mr. Sinai male 34.00 1 0 244367 26.0000 NaN S 1 2 NaN

家族(団体旅行者)内での脱出船に乗る順番の確率を先ほどの'NaN'の部分に代入する。

#敬称ごとにindexの格納
train_mr_index = train['Name'].str.contains(' Mr. ')
train_miss_index = train['Name'].str.contains(' Miss. ')
train_mrs_index = train['Name'].str.contains(' Mrs. ')
train_master_index = train['Name'].str.contains(' Master. ')
test_mr_index = test['Name'].str.contains(' Mr. ')
test_miss_index = test['Name'].str.contains(' Miss. ')
test_mrs_index = test['Name'].str.contains(' Mrs. ')
test_master_index = test['Name'].str.contains(' Master. ')

train['escape_boarding_probability_train'][train_mr_index] = 1 / train['Family'][train_mr_index] 
train['escape_boarding_probability_train'][train_miss_index] = train['Family'][train_miss_index] - 1 / train['Family'][train_miss_index]
train['escape_boarding_probability_train'][train_mrs_index] = 1 / train['Family'][train_mrs_index]
train['escape_boarding_probability_train'][train_master_index] = train['Family'][train_master_index] - 1 / train['Family'][train_master_index]   
train['escape_boarding_probability_train']=train['escape_boarding_probability_train'].fillna(escape_boarding_probability_average_train)

test['escape_boarding_probability_test'][test_mr_index] = 1 / test['Family'][test_mr_index] 
test['escape_boarding_probability_test'][test_miss_index] = test['Family'][test_miss_index] - 1 / test['Family'][test_miss_index]
test['escape_boarding_probability_test'][test_mrs_index] = 1 / test['Family'][test_mrs_index]
test['escape_boarding_probability_test'][test_master_index] = test['Family'][test_master_index] - 1 / test['Family'][test_master_index]    
test['escape_boarding_probability_test']=test['escape_boarding_probability_test'].fillna(escape_boarding_probability_average_test)

~特徴量作成終了~

欠損値の補完

欠損値の補完について、今回は"Age"についての欠損値を敬称ごとに細かく補完したのでその部分を紹介。

敬称ごとの平均年齢を算出し、その値を各敬称ごとに補完(※敬称のない方については全体の中央値で補完を行う)


#敬称ごとの平均値の算出
train_mr = train[train['Name'].str.contains(' Mr. ')]
train_miss = train[train['Name'].str.contains(' Miss. ')]
train_mrs = train[train['Name'].str.contains(' Mrs. ')]
train_master = train[train['Name'].str.contains(' Master. ')]
test_mr = test[test['Name'].str.contains(' Mr. ')]
test_miss = test[test['Name'].str.contains(' Miss. ')]
test_mrs = test[test['Name'].str.contains(' Mrs. ')]
test_master = test[test['Name'].str.contains(' Master. ')]

train_mr_num = train_mr['Age'].dropna().mean()
train_miss_num = train_miss['Age'].dropna().mean()
train_mrs_num = train_mrs['Age'].dropna().mean()
train_master_num = train_master['Age'].dropna().mean()
train_all_num = train['Age'].dropna().median()

test_mr_num = test_mr['Age'].dropna().mean()
test_miss_num = test_miss['Age'].dropna().mean()
test_mrs_num = test_mrs['Age'].dropna().mean()
test_master_num = test_master['Age'].dropna().mean()
test_all_num = test['Age'].dropna().median()

print("trainデータの敬称'Mr'の平均値 = " + str(train_mr_num))
print("trainデータの敬称'Miss'の平均値 = " + str(train_miss_num))
print("trainデータの敬称'Mrs'の平均値 = " + str(train_mrs_num))
print("trainデータの敬称'Master'の平均値 = " + str(train_master_num))
print("trainデータの中央値 = " + str(train_all_num), '\n')

print("testデータの敬称'Mr'の平均値 = " + str(test_mr_num))
print("testデータの敬称'Miss'の平均値 = " + str(test_miss_num))
print("testデータの敬称'Mrs'の平均値 = " + str(test_mrs_num))
print("testデータの敬称'Master'の平均値 = " + str(test_master_num))
print("testデータの中央値 = " + str(test_all_num))
trainデータの敬称'Mr'の平均値 = 32.368090452261306
trainデータの敬称'Miss'の平均値 = 21.773972602739725
trainデータの敬称'Mrs'の平均値 = 35.898148148148145
trainデータの敬称'Master'の平均値 = 4.574166666666667
trainデータの中央値 = 28.0 

testデータの敬称'Mr'の平均値 = 32.0
testデータの敬称'Miss'の平均値 = 21.774843750000002
testデータの敬称'Mrs'の平均値 = 38.903225806451616
testデータの敬称'Master'の平均値 = 7.406470588235294
testデータの中央値 = 27.0
#欠損値”Age”に対しての敬称ごとの平均値の補完
train['Age'][train_mr_index] = train_mr['Age'].fillna(32)
train['Age'][train_miss_index] = train_master['Age'].fillna(22)
train['Age'][train_mrs_index] = train_mrs['Age'].fillna(36)
train['Age'][train_master_index] = train_master['Age'].fillna(5)
train['Age'] = train['Age'].fillna(28)

test['Age'][test_mr_index] = test_mr['Age'].fillna(32)
test['Age'][test_miss_index] = test_miss['Age'].fillna(22)
test['Age'][test_mrs_index] = test_mrs['Age'].fillna(39)
test['Age'][test_master_index] = test_master['Age'].fillna(7)
test['Age'] = test['Age'].fillna(27)

train.isnull().sum()
PassengerId       0
Survived          0
Pclass            0
Name              0
Sex               0
Age               0
SibSp             0
Parch             0
Ticket            0
Fare              0
Cabin           687
Embarked          2
Family_group      0
Family            0
dtype: int64

その他の欠損値を補完後、名称をダミー変数に変換し分類が可能な型に変換して再度特徴量として格納する。



#その他欠損値の補完
train['Embarked'] = train['Embarked'].fillna('S')
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())

#性別と乗船港をダミー変数への変換
dummy_train = pd.get_dummies(train[['Sex', 'Embarked']])
dummy_test = pd.get_dummies(test[['Sex', 'Embarked']])

train_two = pd.concat([train.drop(["Sex", "Embarked"], axis = 1),dummy_train], axis = 1)
test_two = pd.concat([test.drop(["Sex", "Embarked"], axis = 1),dummy_test], axis = 1)

train_two.isnull().sum()
PassengerId       0
Survived          0
Pclass            0
Name              0
Age               0
SibSp             0
Parch             0
Ticket            0
Fare              0
Cabin           687
Family_group      0
Family            0
Sex_female        0
Sex_male          0
Embarked_C        0
Embarked_Q        0
Embarked_S        0
dtype: int64
train_two.head(10)
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Family_group Family Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN 1 2 0 1 0 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 1 2 1 0 1 0 0
2 3 1 3 Heikkinen, Miss. Laina 28.0 0 0 STON/O2. 3101282 7.9250 NaN 0 1 1 0 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 1 2 1 0 0 0 1
4 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN 0 1 0 1 0 0 1
5 6 0 3 Moran, Mr. James 32.0 0 0 330877 8.4583 NaN 0 1 0 1 0 1 0
6 7 0 1 McCarthy, Mr. Timothy J 54.0 0 0 17463 51.8625 E46 0 1 0 1 0 0 1
7 8 0 3 Palsson, Master. Gosta Leonard 2.0 3 1 349909 21.0750 NaN 1 5 0 1 0 0 1
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 27.0 0 2 347742 11.1333 NaN 1 3 1 0 0 0 1
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) 14.0 1 0 237736 30.0708 NaN 1 2 1 0 1 0 0

最後に不要なデータをdropして、データクレンジングを完了する。

#不要な特徴量の削除
train_three = train_two.drop(['PassengerId', 'Name', 'Ticket', 'Cabin','Parch', 'SibSp'], axis = 1)
x_test = test_two.drop(['PassengerId', 'Name', 'Ticket', 'Cabin','Parch', 'SibSp'], axis = 1)

train_three.isnull().sum()
Survived        0
Pclass          0
Age             0
Fare            0
Family_group    0
Family          0
Sex_female      0
Sex_male        0
Embarked_C      0
Embarked_Q      0
Embarked_S      0
dtype: int64

ここからは特徴量ごとにライブラリを使って分類し、モデルの作成を行う。

#データフレーム型への変換
x_train_df = train_three.drop(['Survived'], axis = 1)
x_train = x_train_df

#目的変数の格納
y_train = train_three.Survived

#決定木の学習を行う
depth = 4
clf = tree.DecisionTreeClassifier(max_depth = depth)
clf.fit(x_train_df, y_train)



#applyクラスでリーフごとのリーフ番号を返す
x_train_leaf_no = clf.apply(x_train_df)
x_test_leaf_no = clf.apply(x_test)


#リーフごとにロジスティック回帰分析を行う

#全てのindexを0にした配列を用意しておく
x_train_proba = np.zeros(x_train.shape[0])
x_test_proba = np.zeros(x_test.shape[0])

#重複しないリーフ番号をリストに格納する
unique_leaf_no = list(set(x_train_leaf_no))

#ロジスティック回帰のハイパーパラメータのチューニング
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}

#格納したリーフ番号を取り出す
for i in unique_leaf_no :
    #取り出すリーフ番号の確認
    print('leaf no:', i)

    #trainデータのリーフ番号を指定して取り出したデータフレームを変数に格納
    leaf_data_train_x = x_train[x_train_leaf_no == i]
    leaf_data_train_y = y_train[x_train_leaf_no == i]
    #testデータのリーフ番号を指定して取り出したデータフレームを変数に格納
    leaf_data_test_x = x_test[x_test_leaf_no == i]


    #一度、ダミー変数のデータを除外する
    leaf_data_train_x_drop = leaf_data_train_x.drop(['Family_group', 'Pclass', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_S', 'Embarked_Q', 'escape_boarding_probability_train'], axis = 1)
    leaf_data_test_x = leaf_data_test_x.drop(['Family_group', 'Pclass', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_S', 'Embarked_Q', 'escape_boarding_probability_test'], axis = 1)



    #survivedの値に生存者と死亡者の両方がいる場合
    if len(set(leaf_data_train_y)) > 1:

        #GridSearchを行う
        try:
            grid_search = GridSearchCV(LogisticRegression(), param_grid, cv = 5, scoring = 'roc_auc')   
            grid_search.fit(leaf_data_train_x_drop, leaf_data_train_y)
            clf = LogisticRegression(C=grid_search.best_params_['C'],class_weight="balanced")
        except (ValueError, TypeError, NameError, SyntaxError):
            clf = LogisticRegression()

        #ロジスティック回帰分析を行う
        clf.fit(leaf_data_train_x_drop, leaf_data_train_y)

        #所属しているクラスの確率を戻す
        a = clf.predict_proba(leaf_data_train_x_drop)

        #生存の場合の確率のみを格納
        x_train_proba[x_train_leaf_no == i] = a[:,1]

        if len(leaf_data_test_x) > 0:
            b = clf.predict_proba(leaf_data_test_x)    
            x_test_proba[x_test_leaf_no == i] = b[:,1]


    #survivedの値に生存者と死亡者のどちらかしかいない場合    
    else:
        x_train_proba[x_train_leaf_no == i] = leaf_data_train_y.head(1)
        if len(leaf_data_test_x) > 0:
            x_test_proba[x_test_leaf_no == i] =leaf_data_train_y.head(1)



#ループ終了の確認
print("for loop end")

#生存と死亡の確率のデータフレームを結合
train_data = pd.concat([x_train, pd.DataFrame(x_train_proba)], axis =1)
test_data = pd.concat([x_test, pd.DataFrame(x_test_proba)], axis =1)

#ロジスティック回帰のハイパーパラメータのチューニング
param_grid = {'max_depth': [3,5,8,13,21,34]}

#GridSearchを行う
grid_search = GridSearchCV(GradientBoostingClassifier(n_estimators=100), param_grid, cv = 5, scoring = 'roc_auc')   
grid_search.fit(train_data, y_train)

#勾配ブースティングによる学習と予測
model = GradientBoostingClassifier(max_depth=grid_search.best_params_['max_depth'], n_estimators=100)
model.fit(train_data, y_train)
output = model.predict(test_data).astype(int)


#結果をCSVに変換
leaf_data_test = pd.DataFrame({
    "PassengerId": test["PassengerId"],
    "Survived": output
})
leaf_data_test.to_csv('training_camp06.csv', index = False)       

↑最後は予測したデータをcsvの形式で作成し、そのデータをkaggleに提出すると完了!


上記カーネルへのリンク↓

予測結果

今回、こちらのモデルで予測した場合の結果について、
スコア : 0.82296 上位5% (459位/10499チーム中) という結果が返ってきた。

スクリーンショット 2019-01-13 11.18.11.png

まとめ

分析ツールについての理解が乏しい状態で望んだ今回の課題であったが、逆にそれが既存の特徴量を活かして分析を行うという方針に至り、titanicという問題に素直に取り組むことが出来た。
ライブラリの活用方法に関して中身の理解があまり出来てない状態だったことが原因で、最善ではない部分もあったと思うので、これからライブラリの中身の理解について深めていきたい。
今回のtitanicで学んだことを活かしつつ他の課題に挑戦していこうと思う。

17
14
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
17
14