LoginSignup
38
36

More than 3 years have passed since last update.

【機械学習初心者向け】Kaggleのタイタニック生存予測を実践してみた

Last updated at Posted at 2019-07-31

1.はじめに

機械学習について少しずつアウトプットしてきましたが、一連の流れを実践したものをまだアウトプットしてなかったので、今回は有名なKaggleのタイタニック号生存予測モデルを通して全体像をみていきます。

よければ以前書いた記事も参考にしてください。

過去の記事
・機械学習初心者の私に告ぐ「4つの忠告」
https://qiita.com/takuya_tsurumi/items/38f2858221599d5f93bd

・機械学習のデータ前処理備忘録
https://qiita.com/takuya_tsurumi/items/53b9e3f7427b631b17cf

・【機械学習】決定木モデルの違いをまとめてみた
https://qiita.com/takuya_tsurumi/items/23fdc43ee0e54ec7c87e

2.データの内容

Kaggleのタイタニック号生存予測のデータは以下サイトからダウンロードできます。
https://www.kaggle.com/c/titanic/overview

こちらをダウンロードしてデータの中身を確認してみます。
中身は以下のような要素が入っています。

No カラム名 説明 備考
1 PassengerID 乗客者のID
2 Survived 生存の有無 今回の目的変数となる
3 Pclass チケットの階級
4 Name 名前
5 Sex 性別
6 Age 年齢 端数は切り捨て
7 SibSp タイタニック号に乗っている兄弟/配偶者数
8 Parch タイタニック号に乗っている親/子供の数
9 Ticket チケット番号
10 Fare 旅客運賃
11 Cabin 部屋番号
12 Embarked 乗船港

こちらのデータを使って実践してみます。

3.実践

全体の流れとして以下のような構成です。

1.データの読み込み
2.データの確認
3.データの加工(数値への変換+欠損値処理)
4.モデルの生成
5.モデルの評価

まずは、ライブラリをインストールし、データの読み込みをします。

# ライブラリをインポート
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# データの読み込み
train = pd.read_csv("./data/train.csv", index_col='PassengerId')
test = pd.read_csv('./data/test.csv', index_col='PassengerId')
print(train.head())
print(test.head())

以下表示結果です。

            Survived  Pclass  \
PassengerId                     
1                   0       3   
2                   1       1   
3                   1       3   
4                   1       1   
5                   0       3   

                                                          Name     Sex   Age  \
PassengerId                                                                    
1                                      Braund, Mr. Owen Harris    male  22.0   
2            Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   
3                                       Heikkinen, Miss. Laina  female  26.0   
4                 Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   
5                                     Allen, Mr. William Henry    male  35.0   

             SibSp  Parch            Ticket     Fare Cabin Embarked  
PassengerId                                                          
1                1      0         A/5 21171   7.2500   NaN        S  
2                1      0          PC 17599  71.2833   C85        C  
3                0      0  STON/O2. 3101282   7.9250   NaN        S  
4                1      0            113803  53.1000  C123        S  
5                0      0            373450   8.0500   NaN        S  
             Pclass                                          Name     Sex  \
PassengerId                                                                 
892               3                              Kelly, Mr. James    male   
893               3              Wilkes, Mrs. James (Ellen Needs)  female   
894               2                     Myles, Mr. Thomas Francis    male   
895               3                              Wirz, Mr. Albert    male   
896               3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

              Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
PassengerId                                                       
892          34.5      0      0   330911   7.8292   NaN        Q  
893          47.0      1      0   363272   7.0000   NaN        S  
894          62.0      0      0   240276   9.6875   NaN        Q  
895          27.0      0      0   315154   8.6625   NaN        S  
896          22.0      1      1  3101298  12.2875   NaN        S  

続いて全データをモデルが学習できるように加工のための準備をします。
やっていることとしては、訓練データとテストデータを一旦まとめています。

df = pd.concat([train, test], sort=False, axis='rows')
df

表示結果


    Survived    Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
PassengerId                                         
1   0.0 3   Braund, Mr. Owen Harris male    22.0    1   0   A/5 21171   7.2500  NaN S
2   1.0 1   Cumings, Mrs. John Bradley (Florence Briggs Th...   female  38.0    1   0   PC 17599    71.2833 C85 C
3   1.0 3   Heikkinen, Miss. Laina  female  26.0    0   0   STON/O2. 3101282    7.9250  NaN S
4   1.0 1   Futrelle, Mrs. Jacques Heath (Lily May Peel)    female  35.0    1   0   113803  53.1000 C123    S
5   0.0 3   Allen, Mr. William Henry    male    35.0    0   0   373450  8.0500  NaN S
6   0.0 3   Moran, Mr. James    male    NaN 0   0   330877  8.4583  NaN Q
7   0.0 1   McCarthy, Mr. Timothy J male    54.0    0   0   17463   51.8625 E46 S
8   0.0 3   Palsson, Master. Gosta Leonard  male    2.0 3   1   349909  21.0750 NaN S
9   1.0 3   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   female  27.0    0   2   347742  11.1333 NaN S
10  1.0 2   Nasser, Mrs. Nicholas (Adele Achem) female  14.0    1   0   237736  30.0708 NaN C
11  1.0 3   Sandstrom, Miss. Marguerite Rut female  4.0 1   1   PP 9549 16.7000 G6  S
12  1.0 1   Bonnell, Miss. Elizabeth    female  58.0    0   0   113783  26.5500 C103    S
13  0.0 3   Saundercock, Mr. William Henry  male    20.0    0   0   A/5. 2151   8.0500  NaN S
14  0.0 3   Andersson, Mr. Anders Johan male    39.0    1   5   347082  31.2750 NaN S
15  0.0 3   Vestrom, Miss. Hulda Amanda Adolfina    female  14.0    0   0   350406  7.8542  NaN S
16  1.0 2   Hewlett, Mrs. (Mary D Kingcome) female  55.0    0   0   248706  16.0000 NaN S
17  0.0 3   Rice, Master. Eugene    male    2.0 4   1   382652  29.1250 NaN Q
18  1.0 2   Williams, Mr. Charles Eugene    male    NaN 0   0   244373  13.0000 NaN S
19  0.0 3   Vander Planke, Mrs. Julius (Emelia Maria Vande...   female  31.0    1   0   345763  18.0000 NaN S
20  1.0 3   Masselmani, Mrs. Fatima female  NaN 0   0   2649    7.2250  NaN C
21  0.0 2   Fynney, Mr. Joseph J    male    35.0    0   0   239865  26.0000 NaN S
22  1.0 2   Beesley, Mr. Lawrence   male    34.0    0   0   248698  13.0000 D56 S
23  1.0 3   McGowan, Miss. Anna "Annie" female  15.0    0   0   330923  8.0292  NaN Q
24  1.0 1   Sloper, Mr. William Thompson    male    28.0    0   0   113788  35.5000 A6  S
25  0.0 3   Palsson, Miss. Torborg Danira   female  8.0 3   1   349909  21.0750 NaN S
26  1.0 3   Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...   female  38.0    1   5   347077  31.3875 NaN S
27  0.0 3   Emir, Mr. Farred Chehab male    NaN 0   0   2631    7.2250  NaN C
28  0.0 1   Fortune, Mr. Charles Alexander  male    19.0    3   2   19950   263.0000    C23 C25 C27 S
29  1.0 3   O'Dwyer, Miss. Ellen "Nellie"   female  NaN 0   0   330959  7.8792  NaN Q
30  0.0 3   Todoroff, Mr. Lalio male    NaN 0   0   349216  7.8958  NaN S
... ... ... ... ... ... ... ... ... ... ... ...
1280    NaN 3   Canavan, Mr. Patrick    male    21.0    0   0   364858  7.7500  NaN Q
1281    NaN 3   Palsson, Master. Paul Folke male    6.0 3   1   349909  21.0750 NaN S
1282    NaN 1   Payne, Mr. Vivian Ponsonby  male    23.0    0   0   12749   93.5000 B24 S
1283    NaN 1   Lines, Mrs. Ernest H (Elizabeth Lindsey James)  female  51.0    0   1   PC 17592    39.4000 D28 S
1284    NaN 3   Abbott, Master. Eugene Joseph   male    13.0    0   2   C.A. 2673   20.2500 NaN S
1285    NaN 2   Gilbert, Mr. William    male    47.0    0   0   C.A. 30769  10.5000 NaN S
1286    NaN 3   Kink-Heilmann, Mr. Anton    male    29.0    3   1   315153  22.0250 NaN S
1287    NaN 1   Smith, Mrs. Lucien Philip (Mary Eloise Hughes)  female  18.0    1   0   13695   60.0000 C31 S
1288    NaN 3   Colbert, Mr. Patrick    male    24.0    0   0   371109  7.2500  NaN Q
1289    NaN 1   Frolicher-Stehli, Mrs. Maxmillian (Margaretha ...   female  48.0    1   1   13567   79.2000 B41 C
1290    NaN 3   Larsson-Rondberg, Mr. Edvard A  male    22.0    0   0   347065  7.7750  NaN S
1291    NaN 3   Conlon, Mr. Thomas Henry    male    31.0    0   0   21332   7.7333  NaN Q
1292    NaN 1   Bonnell, Miss. Caroline female  30.0    0   0   36928   164.8667    C7  S
1293    NaN 2   Gale, Mr. Harry male    38.0    1   0   28664   21.0000 NaN S
1294    NaN 1   Gibson, Miss. Dorothy Winifred  female  22.0    0   1   112378  59.4000 NaN C
1295    NaN 1   Carrau, Mr. Jose Pedro  male    17.0    0   0   113059  47.1000 NaN S
1296    NaN 1   Frauenthal, Mr. Isaac Gerald    male    43.0    1   0   17765   27.7208 D40 C
1297    NaN 2   Nourney, Mr. Alfred (Baron von Drachstedt")"    male    20.0    0   0   SC/PARIS 2166   13.8625 D38 C
1298    NaN 2   Ware, Mr. William Jeffery   male    23.0    1   0   28666   10.5000 NaN S
1299    NaN 1   Widener, Mr. George Dunton  male    50.0    1   1   113503  211.5000    C80 C
1300    NaN 3   Riordan, Miss. Johanna Hannah"" female  NaN 0   0   334915  7.7208  NaN Q
1301    NaN 3   Peacock, Miss. Treasteall   female  3.0 1   1   SOTON/O.Q. 3101315  13.7750 NaN S
1302    NaN 3   Naughton, Miss. Hannah  female  NaN 0   0   365237  7.7500  NaN Q
1303    NaN 1   Minahan, Mrs. William Edward (Lillian E Thorpe) female  37.0    1   0   19928   90.0000 C78 Q
1304    NaN 3   Henriksson, Miss. Jenny Lovisa  female  28.0    0   0   347086  7.7750  NaN S
1305    NaN 3   Spector, Mr. Woolf  male    NaN 0   0   A.5. 3236   8.0500  NaN S
1306    NaN 1   Oliva y Ocana, Dona. Fermina    female  39.0    0   0   PC 17758    108.9000    C105    C
1307    NaN 3   Saether, Mr. Simon Sivertsen    male    38.5    0   0   SOTON/O.Q. 3101262  7.2500  NaN S
1308    NaN 3   Ware, Mr. Frederick male    NaN 0   0   359309  8.0500  NaN S
1309    NaN 3   Peter, Master. Michael J    male    NaN 1   1   2668    22.3583 NaN C
1309 rows × 11 columns

データの特性を describe() 関数で確認してみる

# 外れ値の確認
df.describe()

表示結果


    Survived    Pclass  Age SibSp   Parch   Fare
count   891.000000  1309.000000 1046.000000 1309.000000 1309.000000 1308.000000
mean    0.383838    2.294882    29.881138   0.498854    0.385027    33.295479
std 0.486592    0.837836    14.413493   1.041658    0.865560    51.758668
min 0.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25% 0.000000    2.000000    21.000000   0.000000    0.000000    7.895800
50% 0.000000    3.000000    28.000000   0.000000    0.000000    14.454200
75% 1.000000    3.000000    39.000000   1.000000    0.000000    31.275000
max 1.000000    3.000000    80.000000   8.000000    9.000000    512.329200

欠損値の有無を確認


#  欠損値の確認
df.isnull().sum()

表示結果


Survived     418
Pclass         0
Name           0
Sex            0
Age          263
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin       1014
Embarked       2
dtype: int64

{Age, Cabin, Embarked} の3種類に欠損値が存在していることがわかります。
この欠損値をどう処理するのか検討する必要があります。

モデルに読み込ませる型(数値)に全体がなっているかを確認します。


# 型の確認
df.dtypes

表示結果


Survived    float64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

object型である、{Name, Sex, Ticket, Cabin, Embarked}の5種類は数値型への変換を検討します。
まずは Name の変換を実施します。


# Nameの変換
## 要素の確認
df['Name'].value_counts()

表示結果


Connolly, Miss. Kate                                                                  2
Kelly, Mr. James                                                                      2
Saad, Mr. Amin                                                                        1
Lievens, Mr. Rene Aime                                                                1
Turja, Miss. Anna Sofia                                                               1
Collett, Mr. Sidney C Stuart                                                          1
Hassan, Mr. Houssein G N                                                              1
Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)    1
Pavlovic, Mr. Stefo                                                                   1
Swift, Mrs. Frederick Joel (Margaret Welles Barron)                                   1
Lindahl, Miss. Agda Thorilda Viktoria                                                 1
Carter, Mr. William Ernest                                                            1
Ponesell, Mr. Martin                                                                  1
Mellinger, Miss. Madeleine Violet                                                     1
Ryerson, Mr. Arthur Larned                                                            1
Kenyon, Mr. Frederick R                                                               1
O'Connor, Mr. Patrick                                                                 1
Daly, Mr. Peter Denis                                                                 1
Gracie, Col. Archibald IV                                                             1
Samaan, Mr. Hanna                                                                     1
Johnson, Miss. Eleanor Ileen                                                          1
Walker, Mr. William Anderson                                                          1
Torber, Mr. Ernst William                                                             1
Vander Planke, Mr. Leo Edmondus                                                       1
Sage, Mr. John George                                                                 1
Mullens, Miss. Katherine "Katie"                                                      1
Mayne, Mlle. Berthe Antonine ("Mrs de Villiers")                                      1
Olsen, Mr. Henry Margido                                                              1
Behr, Mr. Karl Howell                                                                 1
Braund, Mr. Owen Harris                                                               1
                                                                                     ..
Mack, Mrs. (Mary)                                                                     1
Slayter, Miss. Hilda Mary                                                             1
McCrae, Mr. Arthur Gordon                                                             1
Goldsmith, Mrs. Frank John (Emily Alice Brown)                                        1
Blank, Mr. Henry                                                                      1
Dean, Miss. Elizabeth Gladys Millvina""                                               1
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)                                         1
Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)                               1
O'Dwyer, Miss. Ellen "Nellie"                                                         1
Sage, Miss. Stella Anna                                                               1
Kallio, Mr. Nikolai Erland                                                            1
Demetri, Mr. Marinko                                                                  1
Boulos, Mrs. Joseph (Sultana)                                                         1
Landergren, Miss. Aurora Adelia                                                       1
Braund, Mr. Lewis Richard                                                             1
Wilkes, Mrs. James (Ellen Needs)                                                      1
Clark, Mrs. Walter Miller (Virginia McDowell)                                         1
Lundstrom, Mr. Thure Edvin                                                            1
Dorking, Mr. Edward Arthur                                                            1
Oliva y Ocana, Dona. Fermina                                                          1
Mernagh, Mr. Robert                                                                   1
Shutes, Miss. Elizabeth W                                                             1
Anderson, Mr. Harry                                                                   1
McGowan, Miss. Katherine                                                              1
Eitemiller, Mr. George Floyd                                                          1
Van Impe, Mr. Jean Baptiste                                                           1
Palsson, Miss. Stina Viola                                                            1
Smith, Mrs. Lucien Philip (Mary Eloise Hughes)                                        1
Bourke, Miss. Mary                                                                    1
Wick, Miss. Mary Natalie                                                              1
Name: Name, Length: 1307, dtype: int64

全てユニークなデータとなります。
{Mr., Ms., Miss., Mrs., Master.}で分類してみます。


print('Mr.:{}'.format(df['Name'].str.contains('Mr. ').sum()))
print('Miss.:{}'.format(df['Name'].str.contains('Miss. ').sum()))
print('Mrs.:{}'.format(df['Name'].str.contains('Mrs. ').sum()))
print('Master.:{}'.format(df['Name'].str.contains('Master. ').sum()))

表示結果


Mr.:761
Miss.:260
Mrs.:197
Master.:61

{Mr., Ms., Miss., Mrs., Master.}の合計値でも891にならないため、
一旦置き換えを実施し、その他の名前を確認します。


df.loc[df['Name'].str.contains('Mr. ') == True, 'Name'] = 0
df.loc[df['Name'].str.contains('Miss. ') == True, 'Name'] = 1
df.loc[df['Name'].str.contains('Mrs. ') == True, 'Name'] = 2
df.loc[df['Name'].str.contains('Master. ') == True, 'Name'] = 3

df['Name'].value_counts()

表示結果


0                                                           761
1                                                           258
2                                                           197
3                                                            61
Lahtinen, Rev. William                                        1
Bateman, Rev. Robert James                                    1
Uruchurtu, Don. Manuel E                                      1
Weir, Col. John                                               1
Moraweck, Dr. Ernest                                          1
Leader, Dr. Alice (Farnham)                                   1
Dodge, Dr. Washington                                         1
Stahelin-Maeglin, Dr. Max                                     1
Minahan, Dr. William Edward                                   1
Reynaldo, Ms. Encarnacion                                     1
Harper, Rev. John                                             1
Kirkland, Rev. Charles Leonard                                1
Montvila, Rev. Juozas                                         1
Sagesser, Mlle. Emma                                          1
Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)      1
Oliva y Ocana, Dona. Fermina                                  1
Carter, Rev. Ernest Courtenay                                 1
Peruschitz, Rev. Joseph Maria                                 1
Brewe, Dr. Arthur Jackson                                     1
Reuchlin, Jonkheer. John George                               1
Aubart, Mme. Leontine Pauline                                 1
Pain, Dr. Alfred                                              1
Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan")                  1
Astor, Col. John Jacob                                        1
Peuchen, Major. Arthur Godfrey                                1
Crosby, Capt. Edward Gifford                                  1
Gracie, Col. Archibald IV                                     1
Butt, Major. Archibald Willingham                             1
O'Donoghue, Ms. Bridget                                       1
Frauenthal, Dr. Henry William                                 1
Byles, Rev. Thomas Roussel Davids                             1
Simonius-Blumer, Col. Oberst Alfons                           1
Name: Name, dtype: int64

残りのものの数を確認し変換を実施します。

print('Dr.:{}'.format(df['Name'].str.contains('Dr. ').sum()))
print('Rev.:{}'.format(df['Name'].str.contains('Rev. ').sum()))
print('Col.:{}'.format(df['Name'].str.contains('Col. ').sum()))
print('Major.:{}'.format(df['Name'].str.contains('Major. ').sum()))
print('Jonkheer.:{}'.format(df['Name'].str.contains('Jonkheer. ').sum()))
print('Mme.:{}'.format(df['Name'].str.contains('Mme. ').sum()))
print('Capt.:{}'.format(df['Name'].str.contains('Capt. ').sum()))
print('Ms.:{}'.format(df['Name'].str.contains('Ms. ').sum()))
print('Mlle.:{}'.format(df['Name'].str.contains('Mlle. ').sum()))
print('Don.:{}'.format(df['Name'].str.contains('Don. ').sum()))
print('Countess.:{}'.format(df['Name'].str.contains('Countess. ').sum()))
print('Sir.:{}'.format(df['Name'].str.contains('Sir. ').sum()))

表示結果


Dr.:8
Rev.:8
Col.:4
Major.:2
Jonkheer.:1
Mme.:1
Capt.:1
Ms.:2
Mlle.:1
Don.:1
Countess.:1
Sir.:1

それぞれの値に対して数値に変換を行います。

In [11]:
df.loc[df['Name'].str.contains('Dr. ') == True, 'Name'] = 3
df.loc[df['Name'].str.contains('Rev. ') == True, 'Name'] = 4
df.loc[df['Name'].str.contains('Col. ') == True, 'Name'] = 5
df.loc[df['Name'].str.contains('Major. ') == True, 'Name'] = 6
df.loc[df['Name'].str.contains('Jonkheer. ') == True, 'Name'] = 7
df.loc[df['Name'].str.contains('Mme. ') == True, 'Name'] = 8
df.loc[df['Name'].str.contains('Capt. ') == True, 'Name'] = 9
df.loc[df['Name'].str.contains('Ms. ') == True, 'Name'] = 10
df.loc[df['Name'].str.contains('Mlle. ') == True, 'Name'] = 11
df.loc[df['Name'].str.contains('Don. ') == True, 'Name'] = 12
df.loc[df['Name'].str.contains('Countess. ') == True, 'Name'] = 13
df.loc[df['Name'].str.contains('Sir. ') == True, 'Name'] = 14
df.loc[df['Name'].str.contains('Dona. ') == True, 'Name'] = 15

df['Name'].value_counts()

表示結果


0     761
1     258
2     197
3      69
4       8
5       4
10      2
6       2
15      1
14      1
13      1
12      1
11      1
9       1
8       1
7       1
Name: Name, dtype: int64

Nameの変換が完了しました。
次に Sex を数値へ変換します。

# Sex の変換
## 要素の確認
df['Sex'].value_counts()

表示結果

male      843
female    466
Name: Sex, dtype: int64

maleとfemaleの2種類であるため、{male:0, female:1} へ変換します。

df['Sex'] = df['Sex'].replace({'male':0, 'female':1})

df['Sex'].value_counts()

表示結果

0    843
1    466
Name: Sex, dtype: int64

Sex の変換が完了しました。
次に Ticket を数値へ変換します。

# Ticket の変換
## 要素の確認
df['Ticket'].value_counts()

表示結果

CA. 2343              11
1601                   8
CA 2144                8
347077                 7
3101295                7
347082                 7
PC 17608               7
S.O.C. 14879           7
113781                 6
382652                 6
19950                  6
347088                 6
W./C. 6608             5
16966                  5
349909                 5
113503                 5
220845                 5
4133                   5
PC 17757               5
230136                 4
113760                 4
24160                  4
C.A. 33112             4
W./C. 6607             4
36928                  4
PC 17760               4
LINE                   4
C.A. 34651             4
2666                   4
C.A. 2315              4
                      ..
248744                 1
365235                 1
349228                 1
345765                 1
C 7077                 1
315090                 1
364511                 1
PC 17609               1
C.A. 33111             1
345775                 1
350409                 1
SOTON/O.Q. 3101308     1
367229                 1
PC 17596               1
35851                  1
W/C 14208              1
347464                 1
SOTON/O.Q. 3101306     1
SC/PARIS 2148          1
363272                 1
347064                 1
14311                  1
312992                 1
349215                 1
250650                 1
343120                 1
SC/A.3 2861            1
237393                 1
349236                 1
C.A. 15185             1
Name: Ticket, Length: 929, dtype: int64

複雑な形であるため、一旦カラムを削除する方針とします。
次に Cabin の欠損値処理と数値への変換を実施します。

# Cabinの処理
## 要素の確認
df['Cabin'].value_counts()

表示結果

C23 C25 C27        6
B57 B59 B63 B66    5
G6                 5
C22 C26            4
F33                4
C78                4
F2                 4
F4                 4
D                  4
B96 B98            4
A34                3
B58 B60            3
B51 B53 B55        3
C101               3
E34                3
E101               3
E67                2
D17                2
C65                2
C80                2
C46                2
D37                2
C92                2
E25                2
D10 D12            2
D26                2
B77                2
C106               2
E46                2
C68                2
                  ..
C111               1
F E69              1
E52                1
B52 B54 B56        1
E58                1
B79                1
B37                1
C95                1
B102               1
C87                1
E38                1
C50                1
C148               1
B10                1
C49                1
B61                1
A36                1
D34                1
B39                1
A32                1
E60                1
B101               1
C70                1
C82                1
E39 E41            1
E10                1
A21                1
E63                1
A18                1
D56                1
Name: Cabin, Length: 186, dtype: int64

それぞれTIcketの頭文字で分類をしてみます。

print('A:{}'.format(df['Cabin'].str.contains('A').sum()))
print('B:{}'.format(df['Cabin'].str.contains('B').sum()))
print('C:{}'.format(df['Cabin'].str.contains('C').sum()))
print('D:{}'.format(df['Cabin'].str.contains('D').sum()))
print('E:{}'.format(df['Cabin'].str.contains('E').sum()))
print('F:{}'.format(df['Cabin'].str.contains('F').sum()))
print('G:{}'.format(df['Cabin'].str.contains('G').sum()))
print('T:{}'.format(df['Cabin'].str.contains('T').sum()))

表示結果

A:22
B:65
C:94
D:46
E:44
F:21
G:9
T:1

一時的に欠損値には0を、その他のデータには数が多い順番に1から数値を振っていきます。

df['Cabin'] = df['Cabin'].fillna(0)
df.loc[df['Cabin'].str.contains('C') == True, 'Cabin'] = 1
df.loc[df['Cabin'].str.contains('B') == True, 'Cabin'] = 2
df.loc[df['Cabin'].str.contains('D') == True, 'Cabin'] = 3
df.loc[df['Cabin'].str.contains('E') == True, 'Cabin'] = 4
df.loc[df['Cabin'].str.contains('A') == True, 'Cabin'] = 5
df.loc[df['Cabin'].str.contains('F') == True, 'Cabin'] = 6
df.loc[df['Cabin'].str.contains('G') == True, 'Cabin'] = 7
df.loc[df['Cabin'].str.contains('T') == True, 'Cabin'] = 8

df['Cabin'].value_counts()

表示結果

0    1014
1      94
2      65
3      46
4      44
5      22
6      18
7       5
8       1
Name: Cabin, dtype: int64

Cabin の変換が完了しました。
次に Embarked の欠損値処理と数値への変換を実施します。

# Embarkedの処理
## 要素の確認
df['Embarked'].value_counts()

表示結果

S    914
C    270
Q    123
Name: Embarked, dtype: int64

欠損値が 2 だけなので、一番多い S に合わせます。

df['Embarked'] = df['Embarked'].fillna('S')

df['Embarked'].value_counts()

表示結果


S    916
C    270
Q    123
Name: Embarked, dtype: int64

{S:0, C:1, Q:2}へ変換します。

df.loc[df['Embarked'] == 'S', 'Embarked'] = 0
df.loc[df['Embarked'] == 'C', 'Embarked'] = 1
df.loc[df['Embarked'] == 'Q', 'Embarked'] = 2

df['Embarked'].value_counts()

表示結果

0    916
1    270
2    123
Name: Embarked, dtype: int64

Embarked の処理が完了しました。
Age の欠損値処理をします。
一時的に中央値を代入します。

df['Age'] = df['Age'].fillna(df['Age'].mean())

df.isnull().sum()

表示結果

Survived    418
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin         0
Embarked      0
dtype: int64

Fareの欠損値は2つのみなので、一旦0で穴埋めし、訓練データとテストデータを分割します。

df['Fare'] = df['Fare'].fillna(0)
train = df.loc[:891]
train

表示結果


    Survived    Pclass  Name    Sex Age SibSp   Parch   Ticket  Fare    Cabin   Embarked
PassengerId                                         
1   0.0 3   0   0   22.000000   1   0   A/5 21171   7.2500  0   0
2   1.0 1   2   1   38.000000   1   0   PC 17599    71.2833 1   1
3   1.0 3   1   1   26.000000   0   0   STON/O2. 3101282    7.9250  0   0
4   1.0 1   2   1   35.000000   1   0   113803  53.1000 1   0
5   0.0 3   0   0   35.000000   0   0   373450  8.0500  0   0
6   0.0 3   0   0   29.881138   0   0   330877  8.4583  0   2
7   0.0 1   0   0   54.000000   0   0   17463   51.8625 4   0
8   0.0 3   3   0   2.000000    3   1   349909  21.0750 0   0
9   1.0 3   2   1   27.000000   0   2   347742  11.1333 0   0
10  1.0 2   2   1   14.000000   1   0   237736  30.0708 0   1
11  1.0 3   1   1   4.000000    1   1   PP 9549 16.7000 7   0
12  1.0 1   1   1   58.000000   0   0   113783  26.5500 1   0
13  0.0 3   0   0   20.000000   0   0   A/5. 2151   8.0500  0   0
14  0.0 3   0   0   39.000000   1   5   347082  31.2750 0   0
15  0.0 3   1   1   14.000000   0   0   350406  7.8542  0   0
16  1.0 2   2   1   55.000000   0   0   248706  16.0000 0   0
17  0.0 3   3   0   2.000000    4   1   382652  29.1250 0   2
18  1.0 2   0   0   29.881138   0   0   244373  13.0000 0   0
19  0.0 3   2   1   31.000000   1   0   345763  18.0000 0   0
20  1.0 3   2   1   29.881138   0   0   2649    7.2250  0   1
21  0.0 2   0   0   35.000000   0   0   239865  26.0000 0   0
22  1.0 2   0   0   34.000000   0   0   248698  13.0000 3   0
23  1.0 3   1   1   15.000000   0   0   330923  8.0292  0   2
24  1.0 1   0   0   28.000000   0   0   113788  35.5000 5   0
25  0.0 3   1   1   8.000000    3   1   349909  21.0750 0   0
26  1.0 3   2   1   38.000000   1   5   347077  31.3875 0   0
27  0.0 3   0   0   29.881138   0   0   2631    7.2250  0   1
28  0.0 1   0   0   19.000000   3   2   19950   263.0000    1   0
29  1.0 3   1   1   29.881138   0   0   330959  7.8792  0   2
30  0.0 3   0   0   29.881138   0   0   349216  7.8958  0   0
... ... ... ... ... ... ... ... ... ... ... ...
862 0.0 2   0   0   21.000000   1   0   28134   11.5000 0   0
863 1.0 1   2   1   48.000000   0   0   17466   25.9292 3   0
864 0.0 3   1   1   29.881138   8   2   CA. 2343    69.5500 0   0
865 0.0 2   0   0   24.000000   0   0   233866  13.0000 0   0
866 1.0 2   2   1   42.000000   0   0   236852  13.0000 0   0
867 1.0 2   1   1   27.000000   1   0   SC/PARIS 2149   13.8583 0   1
868 0.0 1   0   0   31.000000   0   0   PC 17590    50.4958 5   0
869 0.0 3   0   0   29.881138   0   0   345777  9.5000  0   0
870 1.0 3   3   0   4.000000    1   1   347742  11.1333 0   0
871 0.0 3   0   0   26.000000   0   0   349248  7.8958  0   0
872 1.0 1   2   1   47.000000   1   1   11751   52.5542 3   0
873 0.0 1   0   0   33.000000   0   0   695 5.0000  2   0
874 0.0 3   0   0   47.000000   0   0   345765  9.0000  0   0
875 1.0 2   2   1   28.000000   1   0   P/PP 3381   24.0000 0   1
876 1.0 3   1   1   15.000000   0   0   2667    7.2250  0   1
877 0.0 3   0   0   20.000000   0   0   7534    9.8458  0   0
878 0.0 3   0   0   19.000000   0   0   349212  7.8958  0   0
879 0.0 3   0   0   29.881138   0   0   349217  7.8958  0   0
880 1.0 1   2   1   56.000000   0   1   11767   83.1583 1   1
881 1.0 2   2   1   25.000000   0   1   230433  26.0000 0   0
882 0.0 3   0   0   33.000000   0   0   349257  7.8958  0   0
883 0.0 3   1   1   22.000000   0   0   7552    10.5167 0   0
884 0.0 2   0   0   28.000000   0   0   C.A./SOTON 34068    10.5000 0   0
885 0.0 3   0   0   25.000000   0   0   SOTON/OQ 392076 7.0500  0   0
886 0.0 3   2   1   39.000000   0   5   382652  29.1250 0   2
887 0.0 2   4   0   27.000000   0   0   211536  13.0000 0   0
888 1.0 1   1   1   19.000000   0   0   112053  30.0000 2   0
889 0.0 3   1   1   29.881138   1   2   W./C. 6607  23.4500 0   0
890 1.0 1   0   0   26.000000   0   0   111369  30.0000 1   1
891 0.0 3   0   0   32.000000   0   0   370376  7.7500  0   2

X_test = df.loc[892:]
X_test = X_test.drop(['Survived', 'Ticket'], axis='columns')
X_test

表示結果


    Pclass  Name    Sex Age SibSp   Parch   Fare    Cabin   Embarked
PassengerId                                 
892 3   0   0   34.500000   0   0   7.8292  0   2
893 3   2   1   47.000000   1   0   7.0000  0   0
894 2   0   0   62.000000   0   0   9.6875  0   2
895 3   0   0   27.000000   0   0   8.6625  0   0
896 3   2   1   22.000000   1   1   12.2875 0   0
897 3   0   0   14.000000   0   0   9.2250  0   0
898 3   1   1   30.000000   0   0   7.6292  0   2
899 2   0   0   26.000000   1   1   29.0000 0   0
900 3   2   1   18.000000   0   0   7.2292  0   1
901 3   0   0   21.000000   2   0   24.1500 0   0
902 3   0   0   29.881138   0   0   7.8958  0   0
903 1   0   0   46.000000   0   0   26.0000 0   0
904 1   2   1   23.000000   1   0   82.2667 2   0
905 2   0   0   63.000000   1   0   26.0000 0   0
906 1   2   1   47.000000   1   0   61.1750 4   0
907 2   2   1   24.000000   1   0   27.7208 0   1
908 2   0   0   35.000000   0   0   12.3500 0   2
909 3   0   0   21.000000   0   0   7.2250  0   1
910 3   1   1   27.000000   1   0   7.9250  0   0
911 3   2   1   45.000000   0   0   7.2250  0   1
912 1   0   0   55.000000   1   0   59.4000 0   1
913 3   3   0   9.000000    0   1   3.1708  0   0
914 1   2   1   29.881138   0   0   31.6833 0   0
915 1   0   0   21.000000   0   1   61.3792 0   1
916 1   2   1   48.000000   1   3   262.3750    2   1
917 3   0   0   50.000000   1   0   14.5000 0   0
918 1   1   1   22.000000   0   1   61.9792 2   1
919 3   0   0   22.500000   0   0   7.2250  0   1
920 1   0   0   41.000000   0   0   30.5000 5   0
921 3   0   0   29.881138   2   0   21.6792 0   1
... ... ... ... ... ... ... ... ... ...
1280    3   0   0   21.000000   0   0   7.7500  0   2
1281    3   3   0   6.000000    3   1   21.0750 0   0
1282    1   0   0   23.000000   0   0   93.5000 2   0
1283    1   2   1   51.000000   0   1   39.4000 3   0
1284    3   3   0   13.000000   0   2   20.2500 0   0
1285    2   0   0   47.000000   0   0   10.5000 0   0
1286    3   0   0   29.000000   3   1   22.0250 0   0
1287    1   2   1   18.000000   1   0   60.0000 1   0
1288    3   0   0   24.000000   0   0   7.2500  0   2
1289    1   2   1   48.000000   1   1   79.2000 2   1
1290    3   0   0   22.000000   0   0   7.7750  0   0
1291    3   0   0   31.000000   0   0   7.7333  0   2
1292    1   1   1   30.000000   0   0   164.8667    1   0
1293    2   0   0   38.000000   1   0   21.0000 0   0
1294    1   1   1   22.000000   0   1   59.4000 0   1
1295    1   0   0   17.000000   0   0   47.1000 0   0
1296    1   0   0   43.000000   1   0   27.7208 3   1
1297    2   0   0   20.000000   0   0   13.8625 3   1
1298    2   0   0   23.000000   1   0   10.5000 0   0
1299    1   0   0   50.000000   1   1   211.5000    1   1
1300    3   1   1   29.881138   0   0   7.7208  0   2
1301    3   1   1   3.000000    1   1   13.7750 0   0
1302    3   1   1   29.881138   0   0   7.7500  0   2
1303    1   2   1   37.000000   1   0   90.0000 1   2
1304    3   1   1   28.000000   0   0   7.7750  0   0
1305    3   0   0   29.881138   0   0   8.0500  0   0
1306    1   15  1   39.000000   0   0   108.9000    1   1
1307    3   0   0   38.500000   0   0   7.2500  0   0
1308    3   0   0   29.881138   0   0   8.0500  0   0
1309    3   3   0   29.881138   1   1   22.3583 0   1
418 rows × 9 columns

訓練データを説明変数と目的変数に分割します。

X_train = train.drop(['Survived', 'Ticket'], axis='columns')
y_train = train['Survived']

print(X_train.head())
print(y_train.head())

表示結果

            Pclass  Name  Sex   Age  SibSp  Parch     Fare  Cabin  Embarked
PassengerId                                                                 
1                 3     0    0  22.0      1      0   7.2500      0         0
2                 1     2    1  38.0      1      0  71.2833      1         1
3                 3     1    1  26.0      0      0   7.9250      0         0
4                 1     2    1  35.0      1      0  53.1000      1         0
5                 3     0    0  35.0      0      0   8.0500      0         0
PassengerId
1    0.0
2    1.0
3    1.0
4    1.0
5    0.0
Name: Survived, dtype: float64

使用するモデルを定義し、訓練データを学習させます。
今回はRandomForestClassifierを使用します。

model = RandomForestClassifier(n_estimators=200, random_state=71)
model.fit(X_train, y_train)

表示結果

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=71, verbose=0, warm_start=False)

定義した予測させます。

y_pred = model.predict(X_test)

正解ラベルを読み込みます。

y_true = pd.read_csv('./data/gender_submission.csv', index_col='PassengerId')
y_true

表示結果


    Survived
PassengerId 
892 0
893 1
894 0
895 0
896 1
897 0
898 1
899 0
900 1
901 0
902 0
903 0
904 1
905 0
906 1
907 1
908 0
909 0
910 1
911 1
912 0
913 0
914 1
915 0
916 1
917 0
918 1
919 0
920 0
921 0
... ...
1280    0
1281    0
1282    0
1283    1
1284    0
1285    0
1286    0
1287    1
1288    0
1289    1
1290    0
1291    0
1292    1
1293    0
1294    1
1295    0
1296    0
1297    0
1298    0
1299    0
1300    1
1301    1
1302    1
1303    1
1304    1
1305    0
1306    1
1307    0
1308    0
1309    0
418 rows × 1 columns

confusiojn_matrix と classification_report を使用し、評価を確認します。

print('正答率:{}'.format(accuracy_score(y_true=y_true , y_pred=y_pred))+ '\n')

print('混合行列:\n{}'.format(confusion_matrix(y_true=y_true , y_pred=y_pred))+ '\n')

print('精度の評価:\n{}'.format(classification_report(y_true=y_true , y_pred=y_pred)))

表示結果


正答率:0.8229665071770335

混合行列:
[[225  41]
 [ 33 119]]

精度の評価:
              precision    recall  f1-score   support

           0       0.87      0.85      0.86       266
           1       0.74      0.78      0.76       152

   micro avg       0.82      0.82      0.82       418
   macro avg       0.81      0.81      0.81       418
weighted avg       0.83      0.82      0.82       418

正答率として8割を超える結果が出ました。
こちらをkaggleで評価する場合、PassengerIDを付与したcsvファイルを作成して提出しましょう。

4.まとめ

機械学習の一連の流れを実装してみました。
まだまだ改善の余地がある結果でした。
コードをみていただくとわかる通り、ほとんどが前処理です。
Kaggleのデータはある程度まとまっているデータが多いので、前処理にかける時間もそこまで多くないかもしれませんが、実務で行う際は、どの変数が効いてくるのか等をトライアンドエラーを繰り返しながら探していく作業をしていきます。(想像以上に泥臭かったです。)

またKaggleのKernelも参考になるという話をよく聞くので、kernelを参考にしつつ今後も精度を改善していきたいと思います。

38
36
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
38
36