1.はじめに
機械学習について少しずつアウトプットしてきましたが、一連の流れを実践したものをまだアウトプットしてなかったので、今回は有名なKaggleのタイタニック号生存予測モデルを通して全体像をみていきます。
よければ以前書いた記事も参考にしてください。
過去の記事
・機械学習初心者の私に告ぐ「4つの忠告」
https://qiita.com/takuya_tsurumi/items/38f2858221599d5f93bd
・機械学習のデータ前処理備忘録
https://qiita.com/takuya_tsurumi/items/53b9e3f7427b631b17cf
・【機械学習】決定木モデルの違いをまとめてみた
https://qiita.com/takuya_tsurumi/items/23fdc43ee0e54ec7c87e
2.データの内容
Kaggleのタイタニック号生存予測のデータは以下サイトからダウンロードできます。
https://www.kaggle.com/c/titanic/overview
こちらをダウンロードしてデータの中身を確認してみます。
中身は以下のような要素が入っています。
No | カラム名 | 説明 | 備考 |
---|---|---|---|
1 | PassengerID | 乗客者のID | |
2 | Survived | 生存の有無 | 今回の目的変数となる |
3 | Pclass | チケットの階級 | |
4 | Name | 名前 | |
5 | Sex | 性別 | |
6 | Age | 年齢 | 端数は切り捨て |
7 | SibSp | タイタニック号に乗っている兄弟/配偶者数 | |
8 | Parch | タイタニック号に乗っている親/子供の数 | |
9 | Ticket | チケット番号 | |
10 | Fare | 旅客運賃 | |
11 | Cabin | 部屋番号 | |
12 | Embarked | 乗船港 |
こちらのデータを使って実践してみます。
3.実践
全体の流れとして以下のような構成です。
1.データの読み込み
2.データの確認
3.データの加工(数値への変換+欠損値処理)
4.モデルの生成
5.モデルの評価
まずは、ライブラリをインストールし、データの読み込みをします。
# ライブラリをインポート
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
# データの読み込み
train = pd.read_csv("./data/train.csv", index_col='PassengerId')
test = pd.read_csv('./data/test.csv', index_col='PassengerId')
print(train.head())
print(test.head())
以下表示結果です。
Survived Pclass \
PassengerId
1 0 3
2 1 1
3 1 3
4 1 1
5 0 3
Name Sex Age \
PassengerId
1 Braund, Mr. Owen Harris male 22.0
2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
3 Heikkinen, Miss. Laina female 26.0
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
5 Allen, Mr. William Henry male 35.0
SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 1 0 A/5 21171 7.2500 NaN S
2 1 0 PC 17599 71.2833 C85 C
3 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 0 113803 53.1000 C123 S
5 0 0 373450 8.0500 NaN S
Pclass Name Sex \
PassengerId
892 3 Kelly, Mr. James male
893 3 Wilkes, Mrs. James (Ellen Needs) female
894 2 Myles, Mr. Thomas Francis male
895 3 Wirz, Mr. Albert male
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female
Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
892 34.5 0 0 330911 7.8292 NaN Q
893 47.0 1 0 363272 7.0000 NaN S
894 62.0 0 0 240276 9.6875 NaN Q
895 27.0 0 0 315154 8.6625 NaN S
896 22.0 1 1 3101298 12.2875 NaN S
続いて全データをモデルが学習できるように加工のための準備をします。
やっていることとしては、訓練データとテストデータを一旦まとめています。
df = pd.concat([train, test], sort=False, axis='rows')
df
表示結果
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
6 0.0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 0.0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
8 0.0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
9 1.0 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
10 1.0 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
11 1.0 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
12 1.0 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
13 0.0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
14 0.0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
15 0.0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
16 1.0 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
17 0.0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
18 1.0 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
19 0.0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
20 1.0 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
21 0.0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
22 1.0 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
23 1.0 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
24 1.0 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
25 0.0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
26 1.0 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
27 0.0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
28 0.0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
29 1.0 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
30 0.0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
1280 NaN 3 Canavan, Mr. Patrick male 21.0 0 0 364858 7.7500 NaN Q
1281 NaN 3 Palsson, Master. Paul Folke male 6.0 3 1 349909 21.0750 NaN S
1282 NaN 1 Payne, Mr. Vivian Ponsonby male 23.0 0 0 12749 93.5000 B24 S
1283 NaN 1 Lines, Mrs. Ernest H (Elizabeth Lindsey James) female 51.0 0 1 PC 17592 39.4000 D28 S
1284 NaN 3 Abbott, Master. Eugene Joseph male 13.0 0 2 C.A. 2673 20.2500 NaN S
1285 NaN 2 Gilbert, Mr. William male 47.0 0 0 C.A. 30769 10.5000 NaN S
1286 NaN 3 Kink-Heilmann, Mr. Anton male 29.0 3 1 315153 22.0250 NaN S
1287 NaN 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 1 0 13695 60.0000 C31 S
1288 NaN 3 Colbert, Mr. Patrick male 24.0 0 0 371109 7.2500 NaN Q
1289 NaN 1 Frolicher-Stehli, Mrs. Maxmillian (Margaretha ... female 48.0 1 1 13567 79.2000 B41 C
1290 NaN 3 Larsson-Rondberg, Mr. Edvard A male 22.0 0 0 347065 7.7750 NaN S
1291 NaN 3 Conlon, Mr. Thomas Henry male 31.0 0 0 21332 7.7333 NaN Q
1292 NaN 1 Bonnell, Miss. Caroline female 30.0 0 0 36928 164.8667 C7 S
1293 NaN 2 Gale, Mr. Harry male 38.0 1 0 28664 21.0000 NaN S
1294 NaN 1 Gibson, Miss. Dorothy Winifred female 22.0 0 1 112378 59.4000 NaN C
1295 NaN 1 Carrau, Mr. Jose Pedro male 17.0 0 0 113059 47.1000 NaN S
1296 NaN 1 Frauenthal, Mr. Isaac Gerald male 43.0 1 0 17765 27.7208 D40 C
1297 NaN 2 Nourney, Mr. Alfred (Baron von Drachstedt")" male 20.0 0 0 SC/PARIS 2166 13.8625 D38 C
1298 NaN 2 Ware, Mr. William Jeffery male 23.0 1 0 28666 10.5000 NaN S
1299 NaN 1 Widener, Mr. George Dunton male 50.0 1 1 113503 211.5000 C80 C
1300 NaN 3 Riordan, Miss. Johanna Hannah"" female NaN 0 0 334915 7.7208 NaN Q
1301 NaN 3 Peacock, Miss. Treasteall female 3.0 1 1 SOTON/O.Q. 3101315 13.7750 NaN S
1302 NaN 3 Naughton, Miss. Hannah female NaN 0 0 365237 7.7500 NaN Q
1303 NaN 1 Minahan, Mrs. William Edward (Lillian E Thorpe) female 37.0 1 0 19928 90.0000 C78 Q
1304 NaN 3 Henriksson, Miss. Jenny Lovisa female 28.0 0 0 347086 7.7750 NaN S
1305 NaN 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S
1306 NaN 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
1307 NaN 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
1308 NaN 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
1309 NaN 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C
1309 rows × 11 columns
データの特性を describe() 関数で確認してみる
# 外れ値の確認
df.describe()
表示結果
Survived Pclass Age SibSp Parch Fare
count 891.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000
mean 0.383838 2.294882 29.881138 0.498854 0.385027 33.295479
std 0.486592 0.837836 14.413493 1.041658 0.865560 51.758668
min 0.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 21.000000 0.000000 0.000000 7.895800
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 39.000000 1.000000 0.000000 31.275000
max 1.000000 3.000000 80.000000 8.000000 9.000000 512.329200
欠損値の有無を確認
# 欠損値の確認
df.isnull().sum()
表示結果
Survived 418
Pclass 0
Name 0
Sex 0
Age 263
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 1014
Embarked 2
dtype: int64
{Age, Cabin, Embarked} の3種類に欠損値が存在していることがわかります。
この欠損値をどう処理するのか検討する必要があります。
モデルに読み込ませる型(数値)に全体がなっているかを確認します。
# 型の確認
df.dtypes
表示結果
Survived float64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
object型である、{Name, Sex, Ticket, Cabin, Embarked}の5種類は数値型への変換を検討します。
まずは Name の変換を実施します。
# Nameの変換
## 要素の確認
df['Name'].value_counts()
表示結果
Connolly, Miss. Kate 2
Kelly, Mr. James 2
Saad, Mr. Amin 1
Lievens, Mr. Rene Aime 1
Turja, Miss. Anna Sofia 1
Collett, Mr. Sidney C Stuart 1
Hassan, Mr. Houssein G N 1
Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo) 1
Pavlovic, Mr. Stefo 1
Swift, Mrs. Frederick Joel (Margaret Welles Barron) 1
Lindahl, Miss. Agda Thorilda Viktoria 1
Carter, Mr. William Ernest 1
Ponesell, Mr. Martin 1
Mellinger, Miss. Madeleine Violet 1
Ryerson, Mr. Arthur Larned 1
Kenyon, Mr. Frederick R 1
O'Connor, Mr. Patrick 1
Daly, Mr. Peter Denis 1
Gracie, Col. Archibald IV 1
Samaan, Mr. Hanna 1
Johnson, Miss. Eleanor Ileen 1
Walker, Mr. William Anderson 1
Torber, Mr. Ernst William 1
Vander Planke, Mr. Leo Edmondus 1
Sage, Mr. John George 1
Mullens, Miss. Katherine "Katie" 1
Mayne, Mlle. Berthe Antonine ("Mrs de Villiers") 1
Olsen, Mr. Henry Margido 1
Behr, Mr. Karl Howell 1
Braund, Mr. Owen Harris 1
..
Mack, Mrs. (Mary) 1
Slayter, Miss. Hilda Mary 1
McCrae, Mr. Arthur Gordon 1
Goldsmith, Mrs. Frank John (Emily Alice Brown) 1
Blank, Mr. Henry 1
Dean, Miss. Elizabeth Gladys Millvina"" 1
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) 1
Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood) 1
O'Dwyer, Miss. Ellen "Nellie" 1
Sage, Miss. Stella Anna 1
Kallio, Mr. Nikolai Erland 1
Demetri, Mr. Marinko 1
Boulos, Mrs. Joseph (Sultana) 1
Landergren, Miss. Aurora Adelia 1
Braund, Mr. Lewis Richard 1
Wilkes, Mrs. James (Ellen Needs) 1
Clark, Mrs. Walter Miller (Virginia McDowell) 1
Lundstrom, Mr. Thure Edvin 1
Dorking, Mr. Edward Arthur 1
Oliva y Ocana, Dona. Fermina 1
Mernagh, Mr. Robert 1
Shutes, Miss. Elizabeth W 1
Anderson, Mr. Harry 1
McGowan, Miss. Katherine 1
Eitemiller, Mr. George Floyd 1
Van Impe, Mr. Jean Baptiste 1
Palsson, Miss. Stina Viola 1
Smith, Mrs. Lucien Philip (Mary Eloise Hughes) 1
Bourke, Miss. Mary 1
Wick, Miss. Mary Natalie 1
Name: Name, Length: 1307, dtype: int64
全てユニークなデータとなります。
{Mr., Ms., Miss., Mrs., Master.}で分類してみます。
print('Mr.:{}'.format(df['Name'].str.contains('Mr. ').sum()))
print('Miss.:{}'.format(df['Name'].str.contains('Miss. ').sum()))
print('Mrs.:{}'.format(df['Name'].str.contains('Mrs. ').sum()))
print('Master.:{}'.format(df['Name'].str.contains('Master. ').sum()))
表示結果
Mr.:761
Miss.:260
Mrs.:197
Master.:61
{Mr., Ms., Miss., Mrs., Master.}の合計値でも891にならないため、
一旦置き換えを実施し、その他の名前を確認します。
df.loc[df['Name'].str.contains('Mr. ') == True, 'Name'] = 0
df.loc[df['Name'].str.contains('Miss. ') == True, 'Name'] = 1
df.loc[df['Name'].str.contains('Mrs. ') == True, 'Name'] = 2
df.loc[df['Name'].str.contains('Master. ') == True, 'Name'] = 3
df['Name'].value_counts()
表示結果
0 761
1 258
2 197
3 61
Lahtinen, Rev. William 1
Bateman, Rev. Robert James 1
Uruchurtu, Don. Manuel E 1
Weir, Col. John 1
Moraweck, Dr. Ernest 1
Leader, Dr. Alice (Farnham) 1
Dodge, Dr. Washington 1
Stahelin-Maeglin, Dr. Max 1
Minahan, Dr. William Edward 1
Reynaldo, Ms. Encarnacion 1
Harper, Rev. John 1
Kirkland, Rev. Charles Leonard 1
Montvila, Rev. Juozas 1
Sagesser, Mlle. Emma 1
Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) 1
Oliva y Ocana, Dona. Fermina 1
Carter, Rev. Ernest Courtenay 1
Peruschitz, Rev. Joseph Maria 1
Brewe, Dr. Arthur Jackson 1
Reuchlin, Jonkheer. John George 1
Aubart, Mme. Leontine Pauline 1
Pain, Dr. Alfred 1
Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") 1
Astor, Col. John Jacob 1
Peuchen, Major. Arthur Godfrey 1
Crosby, Capt. Edward Gifford 1
Gracie, Col. Archibald IV 1
Butt, Major. Archibald Willingham 1
O'Donoghue, Ms. Bridget 1
Frauenthal, Dr. Henry William 1
Byles, Rev. Thomas Roussel Davids 1
Simonius-Blumer, Col. Oberst Alfons 1
Name: Name, dtype: int64
残りのものの数を確認し変換を実施します。
print('Dr.:{}'.format(df['Name'].str.contains('Dr. ').sum()))
print('Rev.:{}'.format(df['Name'].str.contains('Rev. ').sum()))
print('Col.:{}'.format(df['Name'].str.contains('Col. ').sum()))
print('Major.:{}'.format(df['Name'].str.contains('Major. ').sum()))
print('Jonkheer.:{}'.format(df['Name'].str.contains('Jonkheer. ').sum()))
print('Mme.:{}'.format(df['Name'].str.contains('Mme. ').sum()))
print('Capt.:{}'.format(df['Name'].str.contains('Capt. ').sum()))
print('Ms.:{}'.format(df['Name'].str.contains('Ms. ').sum()))
print('Mlle.:{}'.format(df['Name'].str.contains('Mlle. ').sum()))
print('Don.:{}'.format(df['Name'].str.contains('Don. ').sum()))
print('Countess.:{}'.format(df['Name'].str.contains('Countess. ').sum()))
print('Sir.:{}'.format(df['Name'].str.contains('Sir. ').sum()))
表示結果
Dr.:8
Rev.:8
Col.:4
Major.:2
Jonkheer.:1
Mme.:1
Capt.:1
Ms.:2
Mlle.:1
Don.:1
Countess.:1
Sir.:1
それぞれの値に対して数値に変換を行います。
In [11]:
df.loc[df['Name'].str.contains('Dr. ') == True, 'Name'] = 3
df.loc[df['Name'].str.contains('Rev. ') == True, 'Name'] = 4
df.loc[df['Name'].str.contains('Col. ') == True, 'Name'] = 5
df.loc[df['Name'].str.contains('Major. ') == True, 'Name'] = 6
df.loc[df['Name'].str.contains('Jonkheer. ') == True, 'Name'] = 7
df.loc[df['Name'].str.contains('Mme. ') == True, 'Name'] = 8
df.loc[df['Name'].str.contains('Capt. ') == True, 'Name'] = 9
df.loc[df['Name'].str.contains('Ms. ') == True, 'Name'] = 10
df.loc[df['Name'].str.contains('Mlle. ') == True, 'Name'] = 11
df.loc[df['Name'].str.contains('Don. ') == True, 'Name'] = 12
df.loc[df['Name'].str.contains('Countess. ') == True, 'Name'] = 13
df.loc[df['Name'].str.contains('Sir. ') == True, 'Name'] = 14
df.loc[df['Name'].str.contains('Dona. ') == True, 'Name'] = 15
df['Name'].value_counts()
表示結果
0 761
1 258
2 197
3 69
4 8
5 4
10 2
6 2
15 1
14 1
13 1
12 1
11 1
9 1
8 1
7 1
Name: Name, dtype: int64
Nameの変換が完了しました。
次に Sex を数値へ変換します。
# Sex の変換
## 要素の確認
df['Sex'].value_counts()
表示結果
male 843
female 466
Name: Sex, dtype: int64
maleとfemaleの2種類であるため、{male:0, female:1} へ変換します。
df['Sex'] = df['Sex'].replace({'male':0, 'female':1})
df['Sex'].value_counts()
表示結果
0 843
1 466
Name: Sex, dtype: int64
Sex の変換が完了しました。
次に Ticket を数値へ変換します。
# Ticket の変換
## 要素の確認
df['Ticket'].value_counts()
表示結果
CA. 2343 11
1601 8
CA 2144 8
347077 7
3101295 7
347082 7
PC 17608 7
S.O.C. 14879 7
113781 6
382652 6
19950 6
347088 6
W./C. 6608 5
16966 5
349909 5
113503 5
220845 5
4133 5
PC 17757 5
230136 4
113760 4
24160 4
C.A. 33112 4
W./C. 6607 4
36928 4
PC 17760 4
LINE 4
C.A. 34651 4
2666 4
C.A. 2315 4
..
248744 1
365235 1
349228 1
345765 1
C 7077 1
315090 1
364511 1
PC 17609 1
C.A. 33111 1
345775 1
350409 1
SOTON/O.Q. 3101308 1
367229 1
PC 17596 1
35851 1
W/C 14208 1
347464 1
SOTON/O.Q. 3101306 1
SC/PARIS 2148 1
363272 1
347064 1
14311 1
312992 1
349215 1
250650 1
343120 1
SC/A.3 2861 1
237393 1
349236 1
C.A. 15185 1
Name: Ticket, Length: 929, dtype: int64
複雑な形であるため、一旦カラムを削除する方針とします。
次に Cabin の欠損値処理と数値への変換を実施します。
# Cabinの処理
## 要素の確認
df['Cabin'].value_counts()
表示結果
C23 C25 C27 6
B57 B59 B63 B66 5
G6 5
C22 C26 4
F33 4
C78 4
F2 4
F4 4
D 4
B96 B98 4
A34 3
B58 B60 3
B51 B53 B55 3
C101 3
E34 3
E101 3
E67 2
D17 2
C65 2
C80 2
C46 2
D37 2
C92 2
E25 2
D10 D12 2
D26 2
B77 2
C106 2
E46 2
C68 2
..
C111 1
F E69 1
E52 1
B52 B54 B56 1
E58 1
B79 1
B37 1
C95 1
B102 1
C87 1
E38 1
C50 1
C148 1
B10 1
C49 1
B61 1
A36 1
D34 1
B39 1
A32 1
E60 1
B101 1
C70 1
C82 1
E39 E41 1
E10 1
A21 1
E63 1
A18 1
D56 1
Name: Cabin, Length: 186, dtype: int64
それぞれTIcketの頭文字で分類をしてみます。
print('A:{}'.format(df['Cabin'].str.contains('A').sum()))
print('B:{}'.format(df['Cabin'].str.contains('B').sum()))
print('C:{}'.format(df['Cabin'].str.contains('C').sum()))
print('D:{}'.format(df['Cabin'].str.contains('D').sum()))
print('E:{}'.format(df['Cabin'].str.contains('E').sum()))
print('F:{}'.format(df['Cabin'].str.contains('F').sum()))
print('G:{}'.format(df['Cabin'].str.contains('G').sum()))
print('T:{}'.format(df['Cabin'].str.contains('T').sum()))
表示結果
A:22
B:65
C:94
D:46
E:44
F:21
G:9
T:1
一時的に欠損値には0を、その他のデータには数が多い順番に1から数値を振っていきます。
df['Cabin'] = df['Cabin'].fillna(0)
df.loc[df['Cabin'].str.contains('C') == True, 'Cabin'] = 1
df.loc[df['Cabin'].str.contains('B') == True, 'Cabin'] = 2
df.loc[df['Cabin'].str.contains('D') == True, 'Cabin'] = 3
df.loc[df['Cabin'].str.contains('E') == True, 'Cabin'] = 4
df.loc[df['Cabin'].str.contains('A') == True, 'Cabin'] = 5
df.loc[df['Cabin'].str.contains('F') == True, 'Cabin'] = 6
df.loc[df['Cabin'].str.contains('G') == True, 'Cabin'] = 7
df.loc[df['Cabin'].str.contains('T') == True, 'Cabin'] = 8
df['Cabin'].value_counts()
表示結果
0 1014
1 94
2 65
3 46
4 44
5 22
6 18
7 5
8 1
Name: Cabin, dtype: int64
Cabin の変換が完了しました。
次に Embarked の欠損値処理と数値への変換を実施します。
# Embarkedの処理
## 要素の確認
df['Embarked'].value_counts()
表示結果
S 914
C 270
Q 123
Name: Embarked, dtype: int64
欠損値が 2 だけなので、一番多い S に合わせます。
df['Embarked'] = df['Embarked'].fillna('S')
df['Embarked'].value_counts()
表示結果
S 916
C 270
Q 123
Name: Embarked, dtype: int64
{S:0, C:1, Q:2}へ変換します。
df.loc[df['Embarked'] == 'S', 'Embarked'] = 0
df.loc[df['Embarked'] == 'C', 'Embarked'] = 1
df.loc[df['Embarked'] == 'Q', 'Embarked'] = 2
df['Embarked'].value_counts()
表示結果
0 916
1 270
2 123
Name: Embarked, dtype: int64
Embarked の処理が完了しました。
Age の欠損値処理をします。
一時的に中央値を代入します。
df['Age'] = df['Age'].fillna(df['Age'].mean())
df.isnull().sum()
表示結果
Survived 418
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 0
Embarked 0
dtype: int64
Fareの欠損値は2つのみなので、一旦0で穴埋めし、訓練データとテストデータを分割します。
df['Fare'] = df['Fare'].fillna(0)
train = df.loc[:891]
train
表示結果
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0.0 3 0 0 22.000000 1 0 A/5 21171 7.2500 0 0
2 1.0 1 2 1 38.000000 1 0 PC 17599 71.2833 1 1
3 1.0 3 1 1 26.000000 0 0 STON/O2. 3101282 7.9250 0 0
4 1.0 1 2 1 35.000000 1 0 113803 53.1000 1 0
5 0.0 3 0 0 35.000000 0 0 373450 8.0500 0 0
6 0.0 3 0 0 29.881138 0 0 330877 8.4583 0 2
7 0.0 1 0 0 54.000000 0 0 17463 51.8625 4 0
8 0.0 3 3 0 2.000000 3 1 349909 21.0750 0 0
9 1.0 3 2 1 27.000000 0 2 347742 11.1333 0 0
10 1.0 2 2 1 14.000000 1 0 237736 30.0708 0 1
11 1.0 3 1 1 4.000000 1 1 PP 9549 16.7000 7 0
12 1.0 1 1 1 58.000000 0 0 113783 26.5500 1 0
13 0.0 3 0 0 20.000000 0 0 A/5. 2151 8.0500 0 0
14 0.0 3 0 0 39.000000 1 5 347082 31.2750 0 0
15 0.0 3 1 1 14.000000 0 0 350406 7.8542 0 0
16 1.0 2 2 1 55.000000 0 0 248706 16.0000 0 0
17 0.0 3 3 0 2.000000 4 1 382652 29.1250 0 2
18 1.0 2 0 0 29.881138 0 0 244373 13.0000 0 0
19 0.0 3 2 1 31.000000 1 0 345763 18.0000 0 0
20 1.0 3 2 1 29.881138 0 0 2649 7.2250 0 1
21 0.0 2 0 0 35.000000 0 0 239865 26.0000 0 0
22 1.0 2 0 0 34.000000 0 0 248698 13.0000 3 0
23 1.0 3 1 1 15.000000 0 0 330923 8.0292 0 2
24 1.0 1 0 0 28.000000 0 0 113788 35.5000 5 0
25 0.0 3 1 1 8.000000 3 1 349909 21.0750 0 0
26 1.0 3 2 1 38.000000 1 5 347077 31.3875 0 0
27 0.0 3 0 0 29.881138 0 0 2631 7.2250 0 1
28 0.0 1 0 0 19.000000 3 2 19950 263.0000 1 0
29 1.0 3 1 1 29.881138 0 0 330959 7.8792 0 2
30 0.0 3 0 0 29.881138 0 0 349216 7.8958 0 0
... ... ... ... ... ... ... ... ... ... ... ...
862 0.0 2 0 0 21.000000 1 0 28134 11.5000 0 0
863 1.0 1 2 1 48.000000 0 0 17466 25.9292 3 0
864 0.0 3 1 1 29.881138 8 2 CA. 2343 69.5500 0 0
865 0.0 2 0 0 24.000000 0 0 233866 13.0000 0 0
866 1.0 2 2 1 42.000000 0 0 236852 13.0000 0 0
867 1.0 2 1 1 27.000000 1 0 SC/PARIS 2149 13.8583 0 1
868 0.0 1 0 0 31.000000 0 0 PC 17590 50.4958 5 0
869 0.0 3 0 0 29.881138 0 0 345777 9.5000 0 0
870 1.0 3 3 0 4.000000 1 1 347742 11.1333 0 0
871 0.0 3 0 0 26.000000 0 0 349248 7.8958 0 0
872 1.0 1 2 1 47.000000 1 1 11751 52.5542 3 0
873 0.0 1 0 0 33.000000 0 0 695 5.0000 2 0
874 0.0 3 0 0 47.000000 0 0 345765 9.0000 0 0
875 1.0 2 2 1 28.000000 1 0 P/PP 3381 24.0000 0 1
876 1.0 3 1 1 15.000000 0 0 2667 7.2250 0 1
877 0.0 3 0 0 20.000000 0 0 7534 9.8458 0 0
878 0.0 3 0 0 19.000000 0 0 349212 7.8958 0 0
879 0.0 3 0 0 29.881138 0 0 349217 7.8958 0 0
880 1.0 1 2 1 56.000000 0 1 11767 83.1583 1 1
881 1.0 2 2 1 25.000000 0 1 230433 26.0000 0 0
882 0.0 3 0 0 33.000000 0 0 349257 7.8958 0 0
883 0.0 3 1 1 22.000000 0 0 7552 10.5167 0 0
884 0.0 2 0 0 28.000000 0 0 C.A./SOTON 34068 10.5000 0 0
885 0.0 3 0 0 25.000000 0 0 SOTON/OQ 392076 7.0500 0 0
886 0.0 3 2 1 39.000000 0 5 382652 29.1250 0 2
887 0.0 2 4 0 27.000000 0 0 211536 13.0000 0 0
888 1.0 1 1 1 19.000000 0 0 112053 30.0000 2 0
889 0.0 3 1 1 29.881138 1 2 W./C. 6607 23.4500 0 0
890 1.0 1 0 0 26.000000 0 0 111369 30.0000 1 1
891 0.0 3 0 0 32.000000 0 0 370376 7.7500 0 2
X_test = df.loc[892:]
X_test = X_test.drop(['Survived', 'Ticket'], axis='columns')
X_test
表示結果
Pclass Name Sex Age SibSp Parch Fare Cabin Embarked
PassengerId
892 3 0 0 34.500000 0 0 7.8292 0 2
893 3 2 1 47.000000 1 0 7.0000 0 0
894 2 0 0 62.000000 0 0 9.6875 0 2
895 3 0 0 27.000000 0 0 8.6625 0 0
896 3 2 1 22.000000 1 1 12.2875 0 0
897 3 0 0 14.000000 0 0 9.2250 0 0
898 3 1 1 30.000000 0 0 7.6292 0 2
899 2 0 0 26.000000 1 1 29.0000 0 0
900 3 2 1 18.000000 0 0 7.2292 0 1
901 3 0 0 21.000000 2 0 24.1500 0 0
902 3 0 0 29.881138 0 0 7.8958 0 0
903 1 0 0 46.000000 0 0 26.0000 0 0
904 1 2 1 23.000000 1 0 82.2667 2 0
905 2 0 0 63.000000 1 0 26.0000 0 0
906 1 2 1 47.000000 1 0 61.1750 4 0
907 2 2 1 24.000000 1 0 27.7208 0 1
908 2 0 0 35.000000 0 0 12.3500 0 2
909 3 0 0 21.000000 0 0 7.2250 0 1
910 3 1 1 27.000000 1 0 7.9250 0 0
911 3 2 1 45.000000 0 0 7.2250 0 1
912 1 0 0 55.000000 1 0 59.4000 0 1
913 3 3 0 9.000000 0 1 3.1708 0 0
914 1 2 1 29.881138 0 0 31.6833 0 0
915 1 0 0 21.000000 0 1 61.3792 0 1
916 1 2 1 48.000000 1 3 262.3750 2 1
917 3 0 0 50.000000 1 0 14.5000 0 0
918 1 1 1 22.000000 0 1 61.9792 2 1
919 3 0 0 22.500000 0 0 7.2250 0 1
920 1 0 0 41.000000 0 0 30.5000 5 0
921 3 0 0 29.881138 2 0 21.6792 0 1
... ... ... ... ... ... ... ... ... ...
1280 3 0 0 21.000000 0 0 7.7500 0 2
1281 3 3 0 6.000000 3 1 21.0750 0 0
1282 1 0 0 23.000000 0 0 93.5000 2 0
1283 1 2 1 51.000000 0 1 39.4000 3 0
1284 3 3 0 13.000000 0 2 20.2500 0 0
1285 2 0 0 47.000000 0 0 10.5000 0 0
1286 3 0 0 29.000000 3 1 22.0250 0 0
1287 1 2 1 18.000000 1 0 60.0000 1 0
1288 3 0 0 24.000000 0 0 7.2500 0 2
1289 1 2 1 48.000000 1 1 79.2000 2 1
1290 3 0 0 22.000000 0 0 7.7750 0 0
1291 3 0 0 31.000000 0 0 7.7333 0 2
1292 1 1 1 30.000000 0 0 164.8667 1 0
1293 2 0 0 38.000000 1 0 21.0000 0 0
1294 1 1 1 22.000000 0 1 59.4000 0 1
1295 1 0 0 17.000000 0 0 47.1000 0 0
1296 1 0 0 43.000000 1 0 27.7208 3 1
1297 2 0 0 20.000000 0 0 13.8625 3 1
1298 2 0 0 23.000000 1 0 10.5000 0 0
1299 1 0 0 50.000000 1 1 211.5000 1 1
1300 3 1 1 29.881138 0 0 7.7208 0 2
1301 3 1 1 3.000000 1 1 13.7750 0 0
1302 3 1 1 29.881138 0 0 7.7500 0 2
1303 1 2 1 37.000000 1 0 90.0000 1 2
1304 3 1 1 28.000000 0 0 7.7750 0 0
1305 3 0 0 29.881138 0 0 8.0500 0 0
1306 1 15 1 39.000000 0 0 108.9000 1 1
1307 3 0 0 38.500000 0 0 7.2500 0 0
1308 3 0 0 29.881138 0 0 8.0500 0 0
1309 3 3 0 29.881138 1 1 22.3583 0 1
418 rows × 9 columns
訓練データを説明変数と目的変数に分割します。
X_train = train.drop(['Survived', 'Ticket'], axis='columns')
y_train = train['Survived']
print(X_train.head())
print(y_train.head())
表示結果
Pclass Name Sex Age SibSp Parch Fare Cabin Embarked
PassengerId
1 3 0 0 22.0 1 0 7.2500 0 0
2 1 2 1 38.0 1 0 71.2833 1 1
3 3 1 1 26.0 0 0 7.9250 0 0
4 1 2 1 35.0 1 0 53.1000 1 0
5 3 0 0 35.0 0 0 8.0500 0 0
PassengerId
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0
Name: Survived, dtype: float64
使用するモデルを定義し、訓練データを学習させます。
今回はRandomForestClassifierを使用します。
model = RandomForestClassifier(n_estimators=200, random_state=71)
model.fit(X_train, y_train)
表示結果
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
oob_score=False, random_state=71, verbose=0, warm_start=False)
定義した予測させます。
y_pred = model.predict(X_test)
正解ラベルを読み込みます。
y_true = pd.read_csv('./data/gender_submission.csv', index_col='PassengerId')
y_true
表示結果
Survived
PassengerId
892 0
893 1
894 0
895 0
896 1
897 0
898 1
899 0
900 1
901 0
902 0
903 0
904 1
905 0
906 1
907 1
908 0
909 0
910 1
911 1
912 0
913 0
914 1
915 0
916 1
917 0
918 1
919 0
920 0
921 0
... ...
1280 0
1281 0
1282 0
1283 1
1284 0
1285 0
1286 0
1287 1
1288 0
1289 1
1290 0
1291 0
1292 1
1293 0
1294 1
1295 0
1296 0
1297 0
1298 0
1299 0
1300 1
1301 1
1302 1
1303 1
1304 1
1305 0
1306 1
1307 0
1308 0
1309 0
418 rows × 1 columns
confusiojn_matrix と classification_report を使用し、評価を確認します。
print('正答率:{}'.format(accuracy_score(y_true=y_true , y_pred=y_pred))+ '\n')
print('混合行列:\n{}'.format(confusion_matrix(y_true=y_true , y_pred=y_pred))+ '\n')
print('精度の評価:\n{}'.format(classification_report(y_true=y_true , y_pred=y_pred)))
表示結果
正答率:0.8229665071770335
混合行列:
[[225 41]
[ 33 119]]
精度の評価:
precision recall f1-score support
0 0.87 0.85 0.86 266
1 0.74 0.78 0.76 152
micro avg 0.82 0.82 0.82 418
macro avg 0.81 0.81 0.81 418
weighted avg 0.83 0.82 0.82 418
正答率として8割を超える結果が出ました。
こちらをkaggleで評価する場合、PassengerIDを付与したcsvファイルを作成して提出しましょう。
4.まとめ
機械学習の一連の流れを実装してみました。
まだまだ改善の余地がある結果でした。
コードをみていただくとわかる通り、ほとんどが前処理です。
Kaggleのデータはある程度まとまっているデータが多いので、前処理にかける時間もそこまで多くないかもしれませんが、実務で行う際は、どの変数が効いてくるのか等をトライアンドエラーを繰り返しながら探していく作業をしていきます。(想像以上に泥臭かったです。)
またKaggleのKernelも参考になるという話をよく聞くので、kernelを参考にしつつ今後も精度を改善していきたいと思います。