5
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

機械学習で使えるコード

Last updated at Posted at 2019-09-14

ちょっとした分析と前処理

クラメールの連関係数

これで、目的変数と説明変数ともにカテゴリカルデータのときに、どのくらい関連があるかを求められます。

import numpy as np
クラメールの連関係数
def cramersV(x, y):
    
    table = np.array(pd.crosstab(x, y)).astype(np.float32)
    n = table.sum()
    colsum = table.sum(axis=0)
    rowsum = table.sum(axis=1)
    expect = np.outer(rowsum, colsum) / n
    chisq = np.sum((table - expect) ** 2 / expect)
    return np.sqrt(chisq / (n * (np.min(table.shape) - 1)))

オーバーサンプリング

不均衡データに対して、マイノリティ派のデータを補完してあげることができます。

from imblearn.over_sampling import SMOTE
オーバーサンプリング
sm = SMOTE(random_state=42)
X_res, Y_res = sm.fit_sample(X, Y)

One-Hot Encoding

ユニーク変数の数だけ列数を増やすことができます。

One-HotEncoding
new_df = pd.get_dummies(df,drop_first = True)

Ordinal Encoding

列数は増やさずに、カテゴリカルデータを数値変換することができます。

OrdinalEncoding
import category_encoders as ce
categorical_col = list(df.columns)
ce_oe = ce.OrdinalEncoder(cols=categorical_col, handle_unknown='impute')
new_df = ce_oe.fit_transform(df)

#モデリング

Decision Tree(2値分類)

from sklearn.tree import DecisionTreeClassifier
DecisionTree
X = df[['説明変数']]
Y = df['目的変数']

clf = DecisionTreeClassifier(random_state=0)
clf = clf.fit(X, Y)
予測して提出用ファイルをつくる
dt_pred = clf.predict(X_test)
y_submit = pd.DataFrame({"id": X_test["id"], "y":dt_pred})
y_submit.to_csv("../output/results/submit.csv", index=False)
DecisionTreeの可視化
import pydotplus
from IPython.display import Image
from graphviz import Digraph
from sklearn.externals.six import StringIO

dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,feature_names=X.columns, max_depth=3)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# graph.write_pdf("graph.pdf")
# graph.write_png("graph.png")
Image(graph.create_png())

Random Forest

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
RandomForest
X = df[['説明変数']]
Y = df['目的変数']

clf = RandomForestClassifier()
clf.fit(X,Y)
予測して提出用ファイルをつくる
rf_pred = clf.predict(X_test)
y_submit = pd.DataFrame({"id": X_test["id"], "y":rf_pred})
y_submit.to_csv("../output/results/submit.csv", index=False)
重要度の可視化
features = X.columns
importances = clf.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(30,30))
plt.barh(range(len(indices)), importances[indices], color='b', align='center') 
plt.yticks(range(len(indices)), features[indices]) 

plt.show()

#変数が多すぎて見えないときはこれ
# features_show = features[indices][::-1]
# # for i in range(len(features_show)):
# #     print(features_show[i])
# important_show = importances[indices][::-1]
# # for i in range(len(important_show)):
# #     print(important_show[i])

LightGBM

import lightgbm as lgb
学習用データをさらに学習用とテストデータに分ける関数
def data_split(X: pd.DataFrame, y: pd.Series, test_size: float=0.2):
    return train_test_split(X, y, test_size=test_size, random_state=1, stratify=y)
LightGBM
X = df[['説明変数']]
Y = df['目的変数']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# LightGBM parameters
lgbm_params = {
        # 二値分類問題
        'objective': 'binary',
        # AUC の最大化を目指す
        'metric': 'auc',
        'learning_rate':0.1,
        'num_iterations':100,
        'num_leaves':31
    }

# train
gbm = lgb.train(lgbm_params,lgb_train,valid_sets=lgb_eval)
クロスバリデーション
scores = cross_val_score(estimator = clf, X = X_train, y = y_train,cv = 5,n_jobs = 1)
print(np.mean(scores))
予測して提出用ファイルをつくる
gbm_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_submit = pd.DataFrame({"id": X_test["id"], "y":gbm_pred})
y_submit.to_csv("../output/results/submit.csv", index=False) 

XGBoosting

import xgboost as xgb
X = X.values
Y = Y.values

xgb = xgb.XGBRegressor()
#パラメータ設定している時
#xgb = xgb.XGBRegressor(**params) 
xgb.fit(X, Y)
予測して提出用ファイルをつくる
xgb_pred = xgb.predict(X_test)
y_submit = pd.DataFrame({"id": X_test["id"], "y":xgb_pred})
y_submit.to_csv("../output/results/submit.csv", index=False)
5
8
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
5
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?