kaggleのテーブルコンペで初心者がとりあえず提出するまで

Posted at 2020-12-13

とりあえず最初の提出を

kaggle初心者の私が、テーブルコンペに参加時にまずやることをまとめました。
今回はTitanicコンペを例に説明します。

まずは以下の3つのことをやっちゃいます。

欠損値処理
カテゴリ変数のエンコーディング
学習

欠損値処理

まずは欠損値の確認をします。

import pandas as pd
import numpy as np

train = pd.read_csv('../input/titanic/train.csv')
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

欠損値を処理する方法として、カテゴリ変数の場合は最も多いカテゴリに、数値の場合は平均値や中央値として埋めます。
以下のような関数を定義して使いまわしています。

def process_missing_category(df, columns):
    for column in columns:
        most_category = df[column].value_counts().index[0]
        df[column].fillna((most_category), inplace=True)
    return df

def process_missing_value(df, columns, operations):
    for idx, column in enumerate(columns):
        ope = operations[idx]
        if ope == 'mean':
            df[column].fillna((df[column].mean()), inplace=True)
        elif ope == 'median':
            df[column].fillna((df[column].median()), inplace=True)
        else:
            pass
    return df

以下のように使っています。

train = process_missing_category(train, ['Embarked'])
train = process_missing_value(train, ['Fare'], ['mean'])

カテゴリ変数のエンコーディング

カテゴリ変数となりうるカラムを探します。

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

objectであるカラムがカテゴリ変数となりうることが多いです。
ここではSexとEmbarkedをエンコーティングしていきます。
以下のような関数を定義して使いまわしています。

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

def encode_columns(df, columns=[]):
    categorical_features = []
    for column in columns:
        enc_column = f'{column}_enc'
        df[enc_column] = le.fit_transform(df[column])
        categorical_features.append(enc_column)
    df.drop(columns, axis=1, inplace=True)
    return df, categorical_features

以下のように使っています。

train, categorical_features = encode_columns(train, ["Sex", "Embarked"])

学習

学習前に学習データを必要なカラムだけにします。(現状ではうまく数値に変換できなさそうなものを除いています)

delete_columns = ['Name', 'PassengerId','Ticket', 'Cabin']
train.drop(delete_columns, axis=1, inplace=True)

あとはlightgbmを使って学習させます。

from sklearn.model_selection import train_test_split

y = train['Survived']
X = train.drop('Survived', axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

lgb_train = lgb.Dataset(X_train, y_train, categorical_feature=categorical_features)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train, categorical_feature=categorical_features)

params = {
    'objective': 'binary'
}

model = lgb.train(
    params, lgb_train,
    valid_sets=[lgb_train, lgb_eval],
    verbose_eval=10,
    num_boost_round=1000,
    early_stopping_rounds=10
)

終わりに

とりあえず提出したあとは、

ハイパーパラメータチューニング(誰かがチューニングしたものをみて使用することが多い)
交差検定
特徴量探索

といった形で進めていくことが多いです。
来年は一つでもメダルが取れるように頑張るぞ！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up