1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

【練習問題】銀行の顧客ターゲティングをやってみた

Last updated at Posted at 2020-07-14

#はじめに
SIGNATEの【練習問題】銀行の顧客ターゲティングで機械学習とはなんぞやを学習してみた。
【練習問題】銀行の顧客ターゲティング:https://signate.jp/competitions/1

#環境準備

#ライブラリのインポート

#今回使うライブラリをインポート
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import log_loss, accuracy_score
from sklearn.model_selection import KFold

#データ読み込み、前処理

# 学習データ、テストデータの読み込み
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# 学習データを特徴量と目的変数に分ける
train_x = train.drop(['y'], axis=1)
train_y = train['y']

# テストデータは特徴量のみなので、そのままでよい
test_x = test.copy()

# 変数idを除外する
train_x = train_x.drop(['id'], axis=1)
test_x = test_x.drop(['id'], axis=1)

# 各特徴量の変換用の辞書を設定する
marital_mapping = {'married': 3, 'single': 2, 'divorcedw': 1}
education_mapping = {'secondary': 4, 'tertiary': 3, 'primary': 2, 'unknown': 1}
default_mapping = {'no': 0, 'yes': 1}
housing_mapping = {'no': 0, 'yes': 1}
loan_mapping = {'no': 0, 'yes': 1}
month_mapping = {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}

# 訓練データの各特徴量を変換する
train_x['marital'] = train_x['marital'].map(marital_mapping)
train_x['education'] = train_x['education'].map(education_mapping)
train_x['default'] = train_x['default'].map(default_mapping)
train_x['housing'] = train_x['housing'].map(housing_mapping)
train_x['loan'] = train_x['loan'].map(loan_mapping)
train_x['month'] = train_x['month'].map(month_mapping)

# テストデータの各特徴量を変換する
test_x['marital'] = test_x['marital'].map(marital_mapping)
test_x['education'] = test_x['education'].map(education_mapping)
test_x['default'] = test_x['default'].map(default_mapping)
test_x['housing'] = test_x['housing'].map(housing_mapping)
test_x['loan'] = test_x['loan'].map(loan_mapping)
test_x['month'] = test_x['month'].map(month_mapping)

# 訓練データを成功('success')かその他('non-success')でマッピング
for i in range(0,len(train_x)):
    if train_x.loc[i,'poutcome']=='success':
        train_x.loc[i,'poutcome']=='success'
    else:
        train_x.loc[i,'poutcome'] = 'non-success'
poutcome_mapping = {'non-success': 0, 'success': 1}
train_x['poutcome'] = train_x['poutcome'].map(poutcome_mapping)

# 同様にテストデータを成功('success')かその他('non-success')でマッピング
for i in range(0,len(test_x)):
    if test_x.loc[i,'poutcome']=='success':
        test_x.loc[i,'poutcome']=='success'
    else:
        test_x.loc[i,'poutcome'] = 'non-success'
poutcome_mapping = {'non-success': 0, 'success': 1}
test_x['poutcome'] = test_x['poutcome'].map(poutcome_mapping)

データ項目の選定

# parameters = ['age','marital','education','default','balance','housing','loan','day','month','duration','campaign','pdays','previous','poutcome']
parameters = ['age','balance','month','day','duration','pdays','poutcome']

train_x1 = train_x.loc[:,parameters]
test_x1 = test_x.loc[:,parameters]

機械学習、バリデーション


# 各foldのスコアを保存するリスト
scores_accuracy = []
scores_logloss = []

# Modelを作成
class Model:

    def __init__(self, params=None):
        self.model = None
        if params is None:
            self.params = {}
        else:
            self.params = params

    def fit(self, tr_x, tr_y, va_x, va_y):
        # ベースラインのパラメータ
        params = {
            'booster': 'gbtree',
            'objective': 'binary:logistic',
            'eta': 0.2,
            'gamma': 0.0,
            'alpha': 0.0,
            'lambda': 1.0,
            'min_child_weight': 1,
            'max_depth': 8,
            'subsample': 0.8,
            'colsample_bytree': 0.8,
            'random_state': 71,
        }
        params.update(self.params)
        num_round = 10000
        dtrain = xgb.DMatrix(tr_x, label=tr_y)
        dvalid = xgb.DMatrix(va_x, label=va_y)
        watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
        self.model = xgb.train(params, dtrain, num_round, evals=watchlist, early_stopping_rounds=100)     

    def predict(self, x):
        data = xgb.DMatrix(x)
        pred = self.model.predict(data)
        return pred

# クロスバリデーションを行う
# 学習データを4つに分割し、うち1つをバリデーションデータとすることを、バリデーションデータを変えて繰り返す
kf = KFold(n_splits=4, shuffle=True, random_state=71)
for tr_idx, va_idx in kf.split(train_x1):
    # 学習データを学習データとバリデーションデータに分ける
    tr_x, va_x = train_x1.iloc[tr_idx], train_x1.iloc[va_idx]
    tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]

    # モデルの学習を行う
    model = Model()
    model.fit(tr_x, tr_y, va_x, va_y)

    # バリデーションデータの予測値を確率で出力する
    va_pred = model.predict(va_x)

    # バリデーションデータでのスコアを計算する
    logloss = log_loss(va_y, va_pred)
    accuracy = accuracy_score(va_y, va_pred > 0.5)

    # そのfoldのスコアを保存する
    scores_logloss.append(logloss)
    scores_accuracy.append(accuracy)

予測結果

logloss

0.23702760471558257

accuracy

0.897817752875258

提出用ファイルの作成

pred = model.predict(test_x1)
test_y = pd.read_csv('input/test.csv')
submission = pd.DataFrame({'id': test_y['id'], 'y': pred})
submission.to_csv('output/submit_org.csv', index=False, header=None)

SIGNATEへ提出

結果は91%となった。(2020年8月17日時点:806位)
課題としてデータの前処理の精度の向上に取りかかろうと思う。
image.png

1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?