#テーマ : kaggle/titanic における、特徴量エンジニアリングと欠損値の補完について
今回こちらの課題を行うにあたり、特徴量と欠損値に目を付けた予測モデルの開発を行った。
特徴量エンジニアリングは、機械学習モデルのパフォーマンスを向上させるために、特徴量とも呼ばれる追加の予測因子を構築してデータセットに追加する ... つまり、モデルの予測精度を高めるための次の最前線は、データセット自体を改善することである。
##今回の特徴量
kaggle/titanic
- PassengerId: 乗客のID
- Survived: 生存(1)、死亡(0)
- Pclass: 乗客の階級
- Name: 乗客の名前
- Sex: 性別
- Age: 年齢
- SibSp: タイタニック号に乗っていた兄弟、姉妹、義兄弟、義姉妹、夫、妻の数(自分を除く)
- Parch: タイタニック号に乗っていた母親、父親、息子、娘の数
- Ticket: チケットナンバー
- Fare: 乗船料金
- Cabin: キャビン番号
- Embarked: 乗船場
##予測方法
titanicの特徴量を考察して新しく特徴量を作りだし、trainデータを決定木で分類したものをリーフごとにロジスティック回帰で分類し、最後に勾配ブースティングにかけたモデルにより最終的な予測を行った。
##はじめに
まずは、ライブラリのインポートとtrain/testデータの読み込みとデータの中身を確認。
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression # as LR
import matplotlib.pyplot as plt
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
#データの取得
train = pd.read_csv("~/train.csv")
test = pd.read_csv("~/test.csv")
train.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
##特徴量の作成
head()でデータの中身を確認後、新しく作成できる特徴量を考察し、その結果家族の人数と旅行者が団体旅行者か単身旅行者かの特徴量を作り出せることを思い付いたので早速作成することにした
考察元のカラム名:
SibSp: タイタニック号に乗っていた兄弟、姉妹、義兄弟、義姉妹、夫、妻の数(自分を除く)
Parch: タイタニック号に乗っていた母親、父親、息子、娘の数
SibSp + Pearch + 1 を家族数として捉え(今回名前は考慮せず)
それを元に単体旅行者(1人)と団体旅行者(2人以上)を区別
#単身旅行者と団体旅行者の特徴量を新しく作成
train['Family_group'] = train.SibSp + train.Parch
test['Family_group'] = test.SibSp + test.Parch
#家族人数の特徴量を新しく作成
train['Family'] = train.SibSp + train.Parch + 1
test['Family'] = test.SibSp + test.Parch + 1
#Family_groupの特徴量をダミー変数に変換
train['Family_group'] = np.where(train['Family_group'] >= 1, 1, 0)
test['Family_group'] = np.where(test['Family_group'] >= 1, 1, 0)
train.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Family_group | Family | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | 2 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | 2 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | 2 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 1 |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | 0 | 1 |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 0 | 1 |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S | 1 | 5 |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S | 1 | 3 |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C | 1 | 2 |
###脱出船に乗れる確率
旅行者のグループ分けを行なった後にそれらの特徴量を使用して脱出船のある場所まで着いた際に特定の人物が脱出船に乗れるかどうかを確率で求めた特徴量を作成する。
#単身旅行者の場合は確実に乗れるので確率は1とする。
#団体旅行者の中で敬称が無い人の脱出船に乗れる確率
escape_boarding_probability_average_train = 2 / ((sum(train['Family']) - sum(train['Family_group'] == 0))/sum(train['Family_group'] == 1))
escape_boarding_probability_average_test = 2 / ((sum(test['Family']) - sum(test['Family_group'] == 0))/sum(test['Family_group'] == 1))
print("団体旅行者の中で敬称が無い人の脱出船に乗れる確率(1/団体旅行者の中での平均家族数) : ")
print(str(escape_boarding_probability_average_train))
団体旅行者の中で敬称が無い人の脱出船に乗れる確率(1/団体旅行者の中での平均家族数) :
0.6103448275862069
単身旅行者が脱出船に乗る順番が来た際の脱出船に乗れる確率は100%なので’1’を代入しそれ以外は一先ず’NaN’をセットしておく。
train['escape_boarding_probability_train'] = 1
test['escape_boarding_probability_test'] = 1
train['escape_boarding_probability_train'] = train['escape_boarding_probability_train'].replace(1, np.nan)
test['escape_boarding_probability_test'] = test['escape_boarding_probability_test'].replace(1, np.nan)
for i in range(1, 891):
if (train['Family_group'][i] == 0):
train['escape_boarding_probability_train'][i] = 1
for i in range(1, 418):
if (test['Family_group'][i] == 0):
test['escape_boarding_probability_test'][i] = 1
train.head(100)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Family_group | Family | escape_boarding_probability_train | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.00 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | 2 | NaN |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.00 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | 2 | NaN |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.00 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 1 | 1.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.00 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | 2 | NaN |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.00 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 1 | 1.0 |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | 0 | 1 | 1.0 |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.00 | 0 | 0 | 17463 | 51.8625 | E46 | S | 0 | 1 | 1.0 |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.00 | 3 | 1 | 349909 | 21.0750 | NaN | S | 1 | 5 | NaN |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.00 | 0 | 2 | 347742 | 11.1333 | NaN | S | 1 | 3 | NaN |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.00 | 1 | 0 | 237736 | 30.0708 | NaN | C | 1 | 2 | NaN |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.00 | 1 | 1 | PP 9549 | 16.7000 | G6 | S | 1 | 3 | NaN |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.00 | 0 | 0 | 113783 | 26.5500 | C103 | S | 0 | 1 | 1.0 |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.00 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S | 0 | 1 | 1.0 |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.00 | 1 | 5 | 347082 | 31.2750 | NaN | S | 1 | 7 | NaN |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.00 | 0 | 0 | 350406 | 7.8542 | NaN | S | 0 | 1 | 1.0 |
15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.00 | 0 | 0 | 248706 | 16.0000 | NaN | S | 0 | 1 | 1.0 |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.00 | 4 | 1 | 382652 | 29.1250 | NaN | Q | 1 | 6 | NaN |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S | 0 | 1 | 1.0 |
18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.00 | 1 | 0 | 345763 | 18.0000 | NaN | S | 1 | 2 | NaN |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C | 0 | 1 | 1.0 |
20 | 21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35.00 | 0 | 0 | 239865 | 26.0000 | NaN | S | 0 | 1 | 1.0 |
21 | 22 | 1 | 2 | Beesley, Mr. Lawrence | male | 34.00 | 0 | 0 | 248698 | 13.0000 | D56 | S | 0 | 1 | 1.0 |
22 | 23 | 1 | 3 | McGowan, Miss. Anna "Annie" | female | 15.00 | 0 | 0 | 330923 | 8.0292 | NaN | Q | 0 | 1 | 1.0 |
23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.00 | 0 | 0 | 113788 | 35.5000 | A6 | S | 0 | 1 | 1.0 |
24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.00 | 3 | 1 | 349909 | 21.0750 | NaN | S | 1 | 5 | NaN |
25 | 26 | 1 | 3 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... | female | 38.00 | 1 | 5 | 347077 | 31.3875 | NaN | S | 1 | 7 | NaN |
26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C | 0 | 1 | 1.0 |
27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 1 | 6 | NaN |
28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q | 0 | 1 | 1.0 |
29 | 30 | 0 | 3 | Todoroff, Mr. Lalio | male | NaN | 0 | 0 | 349216 | 7.8958 | NaN | S | 0 | 1 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
70 | 71 | 0 | 2 | Jenkin, Mr. Stephen Curnow | male | 32.00 | 0 | 0 | C.A. 33111 | 10.5000 | NaN | S | 0 | 1 | 1.0 |
71 | 72 | 0 | 3 | Goodwin, Miss. Lillian Amy | female | 16.00 | 5 | 2 | CA 2144 | 46.9000 | NaN | S | 1 | 8 | NaN |
72 | 73 | 0 | 2 | Hood, Mr. Ambrose Jr | male | 21.00 | 0 | 0 | S.O.C. 14879 | 73.5000 | NaN | S | 0 | 1 | 1.0 |
73 | 74 | 0 | 3 | Chronopoulos, Mr. Apostolos | male | 26.00 | 1 | 0 | 2680 | 14.4542 | NaN | C | 1 | 2 | NaN |
74 | 75 | 1 | 3 | Bing, Mr. Lee | male | 32.00 | 0 | 0 | 1601 | 56.4958 | NaN | S | 0 | 1 | 1.0 |
75 | 76 | 0 | 3 | Moen, Mr. Sigurd Hansen | male | 25.00 | 0 | 0 | 348123 | 7.6500 | F G73 | S | 0 | 1 | 1.0 |
76 | 77 | 0 | 3 | Staneff, Mr. Ivan | male | NaN | 0 | 0 | 349208 | 7.8958 | NaN | S | 0 | 1 | 1.0 |
77 | 78 | 0 | 3 | Moutal, Mr. Rahamin Haim | male | NaN | 0 | 0 | 374746 | 8.0500 | NaN | S | 0 | 1 | 1.0 |
78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | male | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | S | 1 | 3 | NaN |
79 | 80 | 1 | 3 | Dowdell, Miss. Elizabeth | female | 30.00 | 0 | 0 | 364516 | 12.4750 | NaN | S | 0 | 1 | 1.0 |
80 | 81 | 0 | 3 | Waelens, Mr. Achille | male | 22.00 | 0 | 0 | 345767 | 9.0000 | NaN | S | 0 | 1 | 1.0 |
81 | 82 | 1 | 3 | Sheerlinck, Mr. Jan Baptist | male | 29.00 | 0 | 0 | 345779 | 9.5000 | NaN | S | 0 | 1 | 1.0 |
82 | 83 | 1 | 3 | McDermott, Miss. Brigdet Delia | female | NaN | 0 | 0 | 330932 | 7.7875 | NaN | Q | 0 | 1 | 1.0 |
83 | 84 | 0 | 1 | Carrau, Mr. Francisco M | male | 28.00 | 0 | 0 | 113059 | 47.1000 | NaN | S | 0 | 1 | 1.0 |
84 | 85 | 1 | 2 | Ilett, Miss. Bertha | female | 17.00 | 0 | 0 | SO/C 14885 | 10.5000 | NaN | S | 0 | 1 | 1.0 |
85 | 86 | 1 | 3 | Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... | female | 33.00 | 3 | 0 | 3101278 | 15.8500 | NaN | S | 1 | 4 | NaN |
86 | 87 | 0 | 3 | Ford, Mr. William Neal | male | 16.00 | 1 | 3 | W./C. 6608 | 34.3750 | NaN | S | 1 | 5 | NaN |
87 | 88 | 0 | 3 | Slocovski, Mr. Selman Francis | male | NaN | 0 | 0 | SOTON/OQ 392086 | 8.0500 | NaN | S | 0 | 1 | 1.0 |
88 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 1 | 6 | NaN |
89 | 90 | 0 | 3 | Celotti, Mr. Francesco | male | 24.00 | 0 | 0 | 343275 | 8.0500 | NaN | S | 0 | 1 | 1.0 |
90 | 91 | 0 | 3 | Christmann, Mr. Emil | male | 29.00 | 0 | 0 | 343276 | 8.0500 | NaN | S | 0 | 1 | 1.0 |
91 | 92 | 0 | 3 | Andreasson, Mr. Paul Edvin | male | 20.00 | 0 | 0 | 347466 | 7.8542 | NaN | S | 0 | 1 | 1.0 |
92 | 93 | 0 | 1 | Chaffee, Mr. Herbert Fuller | male | 46.00 | 1 | 0 | W.E.P. 5734 | 61.1750 | E31 | S | 1 | 2 | NaN |
93 | 94 | 0 | 3 | Dean, Mr. Bertram Frank | male | 26.00 | 1 | 2 | C.A. 2315 | 20.5750 | NaN | S | 1 | 4 | NaN |
94 | 95 | 0 | 3 | Coxon, Mr. Daniel | male | 59.00 | 0 | 0 | 364500 | 7.2500 | NaN | S | 0 | 1 | 1.0 |
95 | 96 | 0 | 3 | Shorney, Mr. Charles Joseph | male | NaN | 0 | 0 | 374910 | 8.0500 | NaN | S | 0 | 1 | 1.0 |
96 | 97 | 0 | 1 | Goldschmidt, Mr. George B | male | 71.00 | 0 | 0 | PC 17754 | 34.6542 | A5 | C | 0 | 1 | 1.0 |
97 | 98 | 1 | 1 | Greenfield, Mr. William Bertram | male | 23.00 | 0 | 1 | PC 17759 | 63.3583 | D10 D12 | C | 1 | 2 | NaN |
98 | 99 | 1 | 2 | Doling, Mrs. John T (Ada Julia Bone) | female | 34.00 | 0 | 1 | 231919 | 23.0000 | NaN | S | 1 | 2 | NaN |
99 | 100 | 0 | 2 | Kantor, Mr. Sinai | male | 34.00 | 1 | 0 | 244367 | 26.0000 | NaN | S | 1 | 2 | NaN |
家族(団体旅行者)内での脱出船に乗る順番の確率を先ほどの'NaN'の部分に代入する。
#敬称ごとにindexの格納
train_mr_index = train['Name'].str.contains(' Mr. ')
train_miss_index = train['Name'].str.contains(' Miss. ')
train_mrs_index = train['Name'].str.contains(' Mrs. ')
train_master_index = train['Name'].str.contains(' Master. ')
test_mr_index = test['Name'].str.contains(' Mr. ')
test_miss_index = test['Name'].str.contains(' Miss. ')
test_mrs_index = test['Name'].str.contains(' Mrs. ')
test_master_index = test['Name'].str.contains(' Master. ')
train['escape_boarding_probability_train'][train_mr_index] = 1 / train['Family'][train_mr_index]
train['escape_boarding_probability_train'][train_miss_index] = train['Family'][train_miss_index] - 1 / train['Family'][train_miss_index]
train['escape_boarding_probability_train'][train_mrs_index] = 1 / train['Family'][train_mrs_index]
train['escape_boarding_probability_train'][train_master_index] = train['Family'][train_master_index] - 1 / train['Family'][train_master_index]
train['escape_boarding_probability_train']=train['escape_boarding_probability_train'].fillna(escape_boarding_probability_average_train)
test['escape_boarding_probability_test'][test_mr_index] = 1 / test['Family'][test_mr_index]
test['escape_boarding_probability_test'][test_miss_index] = test['Family'][test_miss_index] - 1 / test['Family'][test_miss_index]
test['escape_boarding_probability_test'][test_mrs_index] = 1 / test['Family'][test_mrs_index]
test['escape_boarding_probability_test'][test_master_index] = test['Family'][test_master_index] - 1 / test['Family'][test_master_index]
test['escape_boarding_probability_test']=test['escape_boarding_probability_test'].fillna(escape_boarding_probability_average_test)
~特徴量作成終了~
#欠損値の補完
欠損値の補完について、今回は**"Age"**についての欠損値を敬称ごとに細かく補完したのでその部分を紹介。
敬称ごとの平均年齢を算出し、その値を各敬称ごとに補完(※敬称のない方については全体の中央値で補完を行う)
#敬称ごとの平均値の算出
train_mr = train[train['Name'].str.contains(' Mr. ')]
train_miss = train[train['Name'].str.contains(' Miss. ')]
train_mrs = train[train['Name'].str.contains(' Mrs. ')]
train_master = train[train['Name'].str.contains(' Master. ')]
test_mr = test[test['Name'].str.contains(' Mr. ')]
test_miss = test[test['Name'].str.contains(' Miss. ')]
test_mrs = test[test['Name'].str.contains(' Mrs. ')]
test_master = test[test['Name'].str.contains(' Master. ')]
train_mr_num = train_mr['Age'].dropna().mean()
train_miss_num = train_miss['Age'].dropna().mean()
train_mrs_num = train_mrs['Age'].dropna().mean()
train_master_num = train_master['Age'].dropna().mean()
train_all_num = train['Age'].dropna().median()
test_mr_num = test_mr['Age'].dropna().mean()
test_miss_num = test_miss['Age'].dropna().mean()
test_mrs_num = test_mrs['Age'].dropna().mean()
test_master_num = test_master['Age'].dropna().mean()
test_all_num = test['Age'].dropna().median()
print("trainデータの敬称'Mr'の平均値 = " + str(train_mr_num))
print("trainデータの敬称'Miss'の平均値 = " + str(train_miss_num))
print("trainデータの敬称'Mrs'の平均値 = " + str(train_mrs_num))
print("trainデータの敬称'Master'の平均値 = " + str(train_master_num))
print("trainデータの中央値 = " + str(train_all_num), '\n')
print("testデータの敬称'Mr'の平均値 = " + str(test_mr_num))
print("testデータの敬称'Miss'の平均値 = " + str(test_miss_num))
print("testデータの敬称'Mrs'の平均値 = " + str(test_mrs_num))
print("testデータの敬称'Master'の平均値 = " + str(test_master_num))
print("testデータの中央値 = " + str(test_all_num))
trainデータの敬称'Mr'の平均値 = 32.368090452261306
trainデータの敬称'Miss'の平均値 = 21.773972602739725
trainデータの敬称'Mrs'の平均値 = 35.898148148148145
trainデータの敬称'Master'の平均値 = 4.574166666666667
trainデータの中央値 = 28.0
testデータの敬称'Mr'の平均値 = 32.0
testデータの敬称'Miss'の平均値 = 21.774843750000002
testデータの敬称'Mrs'の平均値 = 38.903225806451616
testデータの敬称'Master'の平均値 = 7.406470588235294
testデータの中央値 = 27.0
#欠損値”Age”に対しての敬称ごとの平均値の補完
train['Age'][train_mr_index] = train_mr['Age'].fillna(32)
train['Age'][train_miss_index] = train_master['Age'].fillna(22)
train['Age'][train_mrs_index] = train_mrs['Age'].fillna(36)
train['Age'][train_master_index] = train_master['Age'].fillna(5)
train['Age'] = train['Age'].fillna(28)
test['Age'][test_mr_index] = test_mr['Age'].fillna(32)
test['Age'][test_miss_index] = test_miss['Age'].fillna(22)
test['Age'][test_mrs_index] = test_mrs['Age'].fillna(39)
test['Age'][test_master_index] = test_master['Age'].fillna(7)
test['Age'] = test['Age'].fillna(27)
train.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
Family_group 0
Family 0
dtype: int64
その他の欠損値を補完後、名称をダミー変数に変換し分類が可能な型に変換して再度特徴量として格納する。
#その他欠損値の補完
train['Embarked'] = train['Embarked'].fillna('S')
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())
#性別と乗船港をダミー変数への変換
dummy_train = pd.get_dummies(train[['Sex', 'Embarked']])
dummy_test = pd.get_dummies(test[['Sex', 'Embarked']])
train_two = pd.concat([train.drop(["Sex", "Embarked"], axis = 1),dummy_train], axis = 1)
test_two = pd.concat([test.drop(["Sex", "Embarked"], axis = 1),dummy_test], axis = 1)
train_two.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Family_group 0
Family 0
Sex_female 0
Sex_male 0
Embarked_C 0
Embarked_Q 0
Embarked_S 0
dtype: int64
train_two.head(10)
PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Cabin | Family_group | Family | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 1 | 2 | 0 | 1 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 | 2 | 1 | 0 | 1 | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 28.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 1 | 2 | 1 | 0 | 0 | 0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
5 | 6 | 0 | 3 | Moran, Mr. James | 32.0 | 0 | 0 | 330877 | 8.4583 | NaN | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | 1 | 5 | 0 | 1 | 0 | 0 | 1 |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | 1 | 3 | 1 | 0 | 0 | 0 | 1 |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | 1 | 2 | 1 | 0 | 1 | 0 | 0 |
最後に不要なデータをdropして、データクレンジングを完了する。
#不要な特徴量の削除
train_three = train_two.drop(['PassengerId', 'Name', 'Ticket', 'Cabin','Parch', 'SibSp'], axis = 1)
x_test = test_two.drop(['PassengerId', 'Name', 'Ticket', 'Cabin','Parch', 'SibSp'], axis = 1)
train_three.isnull().sum()
Survived 0
Pclass 0
Age 0
Fare 0
Family_group 0
Family 0
Sex_female 0
Sex_male 0
Embarked_C 0
Embarked_Q 0
Embarked_S 0
dtype: int64
ここからは特徴量ごとにライブラリを使って分類し、モデルの作成を行う。
#データフレーム型への変換
x_train_df = train_three.drop(['Survived'], axis = 1)
x_train = x_train_df
#目的変数の格納
y_train = train_three.Survived
#決定木の学習を行う
depth = 4
clf = tree.DecisionTreeClassifier(max_depth = depth)
clf.fit(x_train_df, y_train)
#applyクラスでリーフごとのリーフ番号を返す
x_train_leaf_no = clf.apply(x_train_df)
x_test_leaf_no = clf.apply(x_test)
#リーフごとにロジスティック回帰分析を行う
#全てのindexを0にした配列を用意しておく
x_train_proba = np.zeros(x_train.shape[0])
x_test_proba = np.zeros(x_test.shape[0])
#重複しないリーフ番号をリストに格納する
unique_leaf_no = list(set(x_train_leaf_no))
#ロジスティック回帰のハイパーパラメータのチューニング
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
#格納したリーフ番号を取り出す
for i in unique_leaf_no :
#取り出すリーフ番号の確認
print('leaf no:', i)
#trainデータのリーフ番号を指定して取り出したデータフレームを変数に格納
leaf_data_train_x = x_train[x_train_leaf_no == i]
leaf_data_train_y = y_train[x_train_leaf_no == i]
#testデータのリーフ番号を指定して取り出したデータフレームを変数に格納
leaf_data_test_x = x_test[x_test_leaf_no == i]
#一度、ダミー変数のデータを除外する
leaf_data_train_x_drop = leaf_data_train_x.drop(['Family_group', 'Pclass', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_S', 'Embarked_Q', 'escape_boarding_probability_train'], axis = 1)
leaf_data_test_x = leaf_data_test_x.drop(['Family_group', 'Pclass', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_S', 'Embarked_Q', 'escape_boarding_probability_test'], axis = 1)
#survivedの値に生存者と死亡者の両方がいる場合
if len(set(leaf_data_train_y)) > 1:
#GridSearchを行う
try:
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv = 5, scoring = 'roc_auc')
grid_search.fit(leaf_data_train_x_drop, leaf_data_train_y)
clf = LogisticRegression(C=grid_search.best_params_['C'],class_weight="balanced")
except (ValueError, TypeError, NameError, SyntaxError):
clf = LogisticRegression()
#ロジスティック回帰分析を行う
clf.fit(leaf_data_train_x_drop, leaf_data_train_y)
#所属しているクラスの確率を戻す
a = clf.predict_proba(leaf_data_train_x_drop)
#生存の場合の確率のみを格納
x_train_proba[x_train_leaf_no == i] = a[:,1]
if len(leaf_data_test_x) > 0:
b = clf.predict_proba(leaf_data_test_x)
x_test_proba[x_test_leaf_no == i] = b[:,1]
#survivedの値に生存者と死亡者のどちらかしかいない場合
else:
x_train_proba[x_train_leaf_no == i] = leaf_data_train_y.head(1)
if len(leaf_data_test_x) > 0:
x_test_proba[x_test_leaf_no == i] =leaf_data_train_y.head(1)
#ループ終了の確認
print("for loop end")
#生存と死亡の確率のデータフレームを結合
train_data = pd.concat([x_train, pd.DataFrame(x_train_proba)], axis =1)
test_data = pd.concat([x_test, pd.DataFrame(x_test_proba)], axis =1)
#ロジスティック回帰のハイパーパラメータのチューニング
param_grid = {'max_depth': [3,5,8,13,21,34]}
#GridSearchを行う
grid_search = GridSearchCV(GradientBoostingClassifier(n_estimators=100), param_grid, cv = 5, scoring = 'roc_auc')
grid_search.fit(train_data, y_train)
#勾配ブースティングによる学習と予測
model = GradientBoostingClassifier(max_depth=grid_search.best_params_['max_depth'], n_estimators=100)
model.fit(train_data, y_train)
output = model.predict(test_data).astype(int)
#結果をCSVに変換
leaf_data_test = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": output
})
leaf_data_test.to_csv('training_camp06.csv', index = False)
↑最後は予測したデータをcsvの形式で作成し、そのデータをkaggleに提出すると完了!
上記カーネルへのリンク↓
##予測結果
今回、こちらのモデルで予測した場合の結果について、
スコア : 0.82296 上位5% (459位/10499チーム中) という結果が返ってきた。
##まとめ
分析ツールについての理解が乏しい状態で望んだ今回の課題であったが、逆にそれが既存の特徴量を活かして分析を行うという方針に至り、titanicという問題に素直に取り組むことが出来た。
ライブラリの活用方法に関して中身の理解があまり出来てない状態だったことが原因で、最善ではない部分もあったと思うので、これからライブラリの中身の理解について深めていきたい。
今回のtitanicで学んだことを活かしつつ他の課題に挑戦していこうと思う。