0
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Kaggleに挑戦 Titanic🚢(Notebook🥉)

Last updated at Posted at 2023-02-01

はじめに

今回はKaggleのチュートリアル的なコンペであるTitanic-Machine Learning from Disaster へ挑戦した内容をまとめます.
Accuracy80%になると上位2%程度に入るので,そこを目指していたのですがなかなか77~78%から上がらず,結局断念しました:sob:
テーブルデータはまだまだ処理方法が色々あるんだろうなと思ってはいるので,またいい案を思いついたら再度挑戦しようと思います💪

なんとNotebookでBronze🥉を獲得しました😃(2023/02/01)
精度は別にだけど,みんなに見てもらえるコードだったみたいです.うれしい…!
本記事でも書いてますが,Kaggeleでもコード公開しているので,見てください🙇
Kaggeleのページ

Import

まずは必要なライブラリのimportです.多っ…
ですが,これは見やすくしたり交差検証したりと色々したくてやっているので,本当に単純なモデルでよければこんなに要りませんね.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import TensorDataset
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
# Validation
from torchvision import datasets
from torch.utils.data.dataset import Subset

import csv
import os
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sn 

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold

# モデルの保存
import pickle
# グラフのスタイルを指定
plt.style.use('seaborn-darkgrid')

あとは,GPUが使える環境なら使った方がいいので.DEVICEを設定しておきましょう.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

データ読み込み

読み込みには,csvではなくpandasを使っています.
csvでも読み込めますが,Pandasは処理が非常に楽です.
すごい,まじで.

TRAIN_DIR  = "/kaggle/input/titanic/train.csv"
TEST_DIR   = "/kaggle/input/titanic/test.csv"

TRAIN_DATA = pd.read_csv(TRAIN_DIR)
TEST_DATA  = pd.read_csv(TEST_DIR) 

データの中身の確認もできます.

print(f"{TRAIN_DATA .info()}\n")
print(f"{TEST_DATA .info()}\n")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

データ前処理

欠損値補間

今回扱うテーブルデータには,欠損値があります.Deepで使うときは欠損値は埋めた方がよいですね.

# Show missing value num
print(f"**Train\n\n{TRAIN_DATA.isnull().sum()}\n")
print(f"**Test\n\n{TEST_DATA.isnull().sum()}")
**Train

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

**Test

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

TrainデータはAge,Cabin,Embarked,テストデータはAge,Cabin,Fareが欠損しています.
これらを埋めていきます.

AgeとFareは空いている部分には全体の平均値を入れます.Pandasだとこれ1行で書けます.

# missing-value interpolation(Overall average)
TRAIN_DATA['Age'].fillna(TRAIN_DATA['Age'].mean(), inplace=True)
TEST_DATA['Age'].fillna(TEST_DATA['Age'].mean(), inplace=True)

TEST_DATA['Fare'].fillna(TEST_DATA['Fare'].mean(), inplace=True)

Embarkedは数値ではないので,平均は使えません.ここでは最頻値を指定します.

# missing-value interpolation(Mode)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')

TRAIN_DATA['Embarked'] = imputer.fit_transform(TRAIN_DATA.loc[:,['Embarked']])
TEST_DATA['Embarked'] = imputer.transform(TEST_DATA.loc[:,['Embarked']])

CategolicalデータのOneHotVector化

カテゴリカルデータは,そのまま入れるのではなく,OneHotVectorにして入れます.
何故かは,”Kaggleで使える技術まとめ”の方で書いているので気になる人は見てください.
これもPandasで楽勝です.スゲー

TRAIN_DATA = pd.get_dummies(TRAIN_DATA, columns=["Sex"])
TEST_DATA = pd.get_dummies(TEST_DATA, columns=["Sex"])

TRAIN_DATA = pd.get_dummies(TRAIN_DATA, columns=['Embarked'])
TEST_DATA = pd.get_dummies(TEST_DATA, columns=['Embarked'])

名前はテストデータです.全部使うのではなく,敬称のみを使いたいと思います.Mr,Miss,Mrs,Master,その他の5種類にします.
最後にOneHotVectorにして終了です.

# Extract honorifics from names
def Extract_Honorifics_fromo_Names(DATA):
    title_list = []
    
    for name in DATA["Name"]:
        for name_split in name.split():
            if name_split.endswith(".") == True:
                if name_split != "L.":
                    title_list.append(name_split)

    title_unique = set(title_list)

    for title in title_unique:
        title_count = title_list.count(title)
        
    title_num = []
    for title in title_list:
        if title == "Mr.":
            title_num.append(0)
        elif title == "Miss.":
            title_num.append(1)
        elif title == "Mrs.":
            title_num.append(2)
        elif title == "Master.":
            title_num.append(3)
        else:
            title_num.append(4)

    DATA["Name"] = title_num

Extract_Honorifics_fromo_Names(TRAIN_DATA)
Extract_Honorifics_fromo_Names(TEST_DATA)

TRAIN_DATA = pd.get_dummies(TRAIN_DATA, columns=["Name"])
TEST_DATA = pd.get_dummies(TEST_DATA, columns=["Name"])

不要データの削除

今回は,CabinとTicketを削除しておきます.
Cabinはあまりにも欠損値が多いですね.Ticketはテキストデータなのですが,うまいこと特徴を取ってくるやり方がわかりませんでした:sob:
ので今回はひとまず省いておきます.

category_list = ['Cabin','Ticket']
TRAIN_DATA.drop(category_list, axis=1, inplace=True)
TEST_DATA.drop(category_list, axis=1, inplace=True)

相関関係

これもPandasですぐ出力できます.
テーブルデータにおいてはこいつ最強やな…

TRAIN_DATA.corr()

Corr.png
このように出力されます.今回はSurviveと関係が強い値を見つけたいと思います.
以下の順で相関がありますね.
今回はEmbarkedSあたりまでを使ってみましょうか.
Name0>Sex_male>Sex_female>Name2>Pclass>Name1>Fare>Embarked_C>Embarked_S>Name3>Parch>Age>Sivsp>Embarked_Q>Name4

データ選抜

今までのAnalyzeで必要なデータを取り出します.
GroundTruthも別で取り出しましょう.

FEATURE_TRAIN_DATA = TRAIN_DATA[["Pclass","Fare","Sex_female","Sex_male","Embarked_C","Embarked_S","Name_0","Name_1","Name_2"]].values
GROUND_TRUTH = TRAIN_DATA['Survived'].values
FEATURE_TEST_DATA = TEST_DATA[["Pclass","Fare","Sex_female","Sex_male","Embarked_C","Embarked_S","Name_0","Name_1","Name_2"]].values

データNormalidation

それぞれの特徴間で優劣が出ないように,全ての値を(0,1)に正規化します.

# Normalization
from sklearn import preprocessing
mm = preprocessing.MinMaxScaler()

FEATURE_TRAIN_DATA = mm.fit_transform(FEATURE_TRAIN_DATA)
FEATURE_TEST_DATA = mm.fit_transform(FEATURE_TEST_DATA)

決定木

さぁ,モデルに入れましょう.
まずは決定木でやります.これは秒で終わります.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth = 3)
result = model.fit(FEATURE_TRAIN_DATA, GROUND_TRUTH)

model.score(FEATURE_TRAIN_DATA, GROUND_TRUTH)

predict = model.predict(FEATURE_TEST_DATA)

submit_csv = pd.concat([TEST_DATA['PassengerId'], pd.Series(predict)], axis=1)
submit_csv.columns = ['PassengerId', 'Survived']
submit_csv.to_csv('/kaggle/working/submition_tree.csv', index=False)

DeepLearning

Deepとりあえず全部載せておきます…
そこ省かないでくれよ:sob:っていうのを自分はすごく経験したので…

ただ人様に見せられるようなコードじゃありません.TestとValidation書き換えてなっかったりするし…
ごめん:innocent:

まずはDatasetにします.BatchSizeは4としています.

X_tensor = torch.Tensor(FEATURE_TRAIN_DATA)
y_tensor = torch.Tensor(GROUND_TRUTH)
dataset = TensorDataset(X_tensor, y_tensor)
X_tensor = torch.Tensor(FEATURE_TEST_DATA)
test_dataset = TensorDataset(X_tensor)

batch_size = 4

学習の処理です.

def train_epoch(model, optimizer, criterion, dataloader, device):
    train_loss = 0
    model.train()
    
    for i, (images, labels) in enumerate(dataloader):
        labels = labels.type(torch.LongTensor) 
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    train_loss = train_loss / len(dataloader.dataset)
    
    return train_loss

検証(Validation)の処理です.

def validation(model, optimizer, criterion, dataloader, device):
    model.eval()
    val_loss=0
    
    with torch.no_grad():
        for i, (images, labels) in enumerate(dataloader):
            labels = labels.type(torch.LongTensor) 
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            val_loss += loss.item()
        val_loss = val_loss / len(dataloader.dataset)
    return val_loss

EarlyStoppingについてです.
ValidationLossがPatience回連続で更新されないとき,学習を終了します.

class EarlyStopping:

    def __init__(self, patience=30, verbose=False, path='checkpoint_model.pth'):
        self.patience = patience
        self.verbose = verbose
        self.counter = 0  
        self.best_score = None 
        self.early_stop = False   
        self.val_loss_min = np.Inf   
        self.path = path            

    def __call__(self, val_loss, model):
        score = -val_loss

        if self.best_score is None:  
            self.best_score = score 
            self.checkpoint(val_loss, model)  
        elif score < self.best_score:  
            self.counter += 1  
            if self.verbose:  
                print(f'EarlyStopping counter: {self.counter} out of {self.patience}')   
            if self.counter >= self.patience:  
                self.early_stop = True
        else:  
            self.best_score = score  
            self.checkpoint(val_loss, model)  
            self.counter = 0  

    def checkpoint(self, val_loss, model):
        if self.verbose:  
            print(f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}).  Saving model ...')
        torch.save(model.state_dict(), self.path)  
        self.val_loss_min = val_loss  

RUN部分です.
Epoch回数Train・Validationを行います.
EarlyStoppingによって,Epoch回数MAXで回さない場合もあります.
というか大体Epochを大きく設定しておいて,EarlyStoppingによって止めますね.

def run(num_epochs, optimizer, criterion, device, train_loader, val_loader,model):
    train_loss_list = []
    val_loss_list = []
    
    earlystopping = EarlyStopping(verbose=True)

    for epoch in range(num_epochs):
        train_loss = train_epoch(model, optimizer, criterion, train_loader, device)
        val_loss = validation(model, optimizer, criterion, val_loader, device)

        print(f'Epoch [{epoch+1}], train_Loss : {train_loss:.4f}')
        train_loss_list.append(train_loss)
        val_loss_list.append(val_loss)

        earlystopping(val_loss_list[-1], model)
        if earlystopping.early_stop: 
          print("Early Stopping!")
          break
      
    return train_loss_list, val_loss_list

Lossの推移グラフです.

def graph(train_loss_list, val_loss_list):
  num_epochs=len(train_loss_list)
  fig, ax = plt.subplots(figsize=(4, 3), dpi=100)
  ax.plot(range(num_epochs), train_loss_list, c='b', label='train loss')
  ax.plot(range(num_epochs), val_loss_list, c='r', label='test loss')
  ax.set_xlabel('epoch', fontsize='10')
  ax.set_ylabel('loss', fontsize='10')
  ax.set_title('training and test loss', fontsize='10')
  ax.grid()
  ax.legend(fontsize='10')
  plt.show()
## CV ALL CONFUSION MATRIX
cv_y_true,cv_y_pred = [],[]

混同行列のPlot関数です.

def print_confusion_matrix(test_loader,model):
    
    model.eval()
    y_true,y_pred = [],[]
    
    with torch.no_grad():
        for i, (images, labels) in enumerate(test_loader):
            labels = labels.type(torch.LongTensor) 
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            for nval in range(len(labels)):
                #y_true.append(torch.argmax(labels[nval]))
                y_true.append(labels[nval])
                y_pred.append(torch.argmax(outputs[nval]))
                

    for leny in range(len(y_true)):
        y_true[leny] = y_true[leny].item()
        y_pred[leny] = y_pred[leny].item()
    
    ## CV ALL CONFUSION MATRIX
    cv_y_true.append(y_true)
    cv_y_pred.append(y_pred)
    
    target_names = ['0', '1']
    cmx = confusion_matrix(y_true, y_pred)
    df_cmx = pd.DataFrame(cmx, index=target_names, columns=target_names)
    plt.figure(figsize = (6,3))
    sn.heatmap(df_cmx, annot=True, annot_kws={"size": 18}, fmt="d", cmap='Blues')
    plt.show()   
    
    print(classification_report(y_true, y_pred, target_names=target_names))
    print("accuracy: ", accuracy_score(y_true, y_pred))

交差検証の為の準備です.
ここでは,シンプルな交差検証ではなく,層化分割交差検証にしています.
クラス分類であれば,これを使った方が良いです.

cv = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

モデルです.LinearとReLUで構築されてます.

class DNN(nn.Module):

    def __init__(self):
        super(DNN, self).__init__()
        self.fc1 = nn.Linear(9,128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 2)
        self.leakyrelu = nn.PReLU(init=0.01)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = torch.sigmoid(x)
        return x

メイン部分です.上記で実装した関数を呼び出していきます.

fold_train_list = []
fold_val_list = []
fold_test_list = []

for i,(train_index, test_index) in enumerate(cv.split(FEATURE_TRAIN_DATA,GROUND_TRUTH)):    

    # モデル指定
    model = DNN()
    model = model.to(device)
    weights = torch.tensor([1.0, 1.0])
    criterion = nn.CrossEntropyLoss(weight=weights)
    optimizer = optim.Adam(model.parameters(),lr=0.0001)

    # train/test 分割
    cv_train_dataset = Subset(dataset, train_index)
    cv_val_dataset  = Subset(dataset, test_index)

    train_loader = DataLoader(cv_train_dataset, batch_size=batch_size, shuffle=True)
    val_loader   = DataLoader(cv_val_dataset, batch_size=batch_size, shuffle=True)

    # run
    print(f"***FOLD {i}")
    train_loss_list, val_loss_list = run(300, optimizer, criterion, device, train_loader, val_loader,model)
    model.load_state_dict(torch.load('checkpoint_model.pth'))
    
    # Model Save
    ModelSavePath='model'+str(i)+'.pth'
    torch.save(model.state_dict(), ModelSavePath)
    
    # PLOT
    graph(train_loss_list, val_loss_list)
    print_confusion_matrix(val_loader,model)
    
    # 各実行の最後のLossを保存
    fold_train_list.append(train_loss_list[-1])
    fold_val_list.append(val_loss_list[-1])
    print("-----------------\n")

ここからは,学習したモデルを使ったテストデータに対する予測を行います.
5つのモデルそれぞれの結果を出して,Votingを行っています.
5種類の結果を見て,多数決をしているので,1つだけ使うよりもいい精度が出るはずです.(まぁ今回はそんな変わりませんでしたが)
Kaggele上で指定されている形式にしてから,csvとしてOUTPUTフォルダに保存しています.

test_loader  = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
y_pred = []

for i in range(5):
    
    # Model Load
    ModelSavePath='model'+str(i)+'.pth'
    model.load_state_dict(torch.load(ModelSavePath))    
    model.eval()
    
    y_pred_tmp = []
    
    with torch.no_grad():
        for table_data in test_loader:
            table_data = table_data[0].clone().detach()
            table_data = table_data.to(device).detach()
            outputs = model(table_data)
            for nval in range(len(outputs)):
                y_pred_tmp.append(outputs[nval])
        
    y_pred.append(y_pred_tmp)

y_pred_bote = []

for i in range(len(y_pred[0])):
    for j in range(5):
        tmp0,tmp1 = 0,0
        tmp0 += y_pred[j][i][0]
        tmp1 += y_pred[j][i][1]
    if tmp0 > tmp1:
        y_pred_bote.append(0)
    else:
        y_pred_bote.append(1) 

output = [["PassengerId","Survived"]]
for i,Test in enumerate(list(TEST_DATA["PassengerId"])):
    tmp = []
    tmp.append(Test)
    tmp.append(y_pred_bote[i])
    output.append(tmp)

output_path = "/kaggle/working/submission_deep.csv"

with open(output_path, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(output)

結果

という風になんやかんやで作ったCSVファイルをKaggleへSubmitじゃ~!
決定木:0.76555
Deep:0.78708
ちょっとだけDeepの方が精度がよかったです.

さいごに

テーブルデータはなかなか難しいですね.
やっぱり前処理をもっと考えないとなんだろうな.
しかしPandasスゲーな…

今度はKaggleの画像コンペ等にも出たいと思っています📷

0
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?