More than 5 years have passed since last update.

Kaggle:Manual Feature Engineering (part two)

Kaggle

Last updated at 2020-02-10Posted at 2020-02-10

とりあえずやってみるけど、どこまでやる必要があるのか見極めないといけない。

Function to Aggregate Numeric Data

前回と同じ関数だが、改めて

This groups data by the group_var and calculates mean, max, min, and sum. It will only be applied to numeric data by default in pandas.

    # Only want the numeric variables
    parent_ids = df[parent_var].copy()
    numeric_df = df.select_dtypes('number').copy()
    numeric_df[parent_var] = parent_ids

結局集計(mean,maxなど）するのは数値なんだから数値だけ返してくるのは潔い。分かりやすい。取り回しがしやすいと思う。

Function to Convert Data Types

This will help reduce memory usage by using more efficient types for the variables. For example category is often a better type than object (unless the number of unique categories is close to the number of rows in the dataframe).

object型よりcategory型のほうがメモリ使わないよ、と。ただほぼ一意のデータが入っている場合はobject型のほうがよさそう。

       # Float64 to float32
        elif df[c].dtype == float:
            df[c] = df[c].astype(np.float32)
            
        # Int64 to int32
        elif df[c].dtype == int:
            df[c] = df[c].astype(np.int32)

これはなんでintをintに変えているのかと思ったが、

NumPyは基本的には、大量のデータ操作を高速に実行できるように内部ではCで実装されています。Python自体はそれほど高速な言語ではないため、行列演算の操作やデータの扱いはCから行われます。

つまり、正しくNumPy配列のデータ型を指定することでPythonからでもメモリ効率と実行効率の良いコードを実装することができます。

ということみたいです。

最終的に出来たデータセットは2GB超え。

print(f'Final training size: {return_size(train)}')
print(f'Final testing size: {return_size(test)}')

Final training size: 2.12
Final testing size: 0.34

Unfortunately, saving all the created features does not work in a Kaggle notebook. You will have to run the code on your personal machine. I have run the code and uploaded the entire datasets here. I plan on doing some feature selection and uploading reduced versions of the datasets. Right now, they are slightly to big to handle in Kaggle notebooks or scripts. .

# train.to_csv('train_previous_raw.csv', index = False, chunksize = 500)
# test.to_csv('test_previous_raw.csv', index = False)

Kaggle notebookはどれだけのサイズまで扱えるのでしょうか？そもそも何が駄目なんだ？大きいデータセットをtoCSV出来ない？

まあでもローカルマシンでto_csvしてもすごい時間かかるので・・notebookだとTimeOutするんだろうか。
ファイルに書き出したら1.4GBぐらいでした。

Modeling

いきなり

submission, fi, metrics = model(features, test_features)

と、出てきていないfeatures, test_featuresが渡される謎。notebookも当然エラーになってるし、、力尽きたのかな？そういうことあるよね。

ライブラリも読み込んでないので、当然

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-34-31876704d418> in <module>
----> 1 submission, fi, metrics = model(train , test)

<ipython-input-33-92c6955023ff> in model(features, test_features, encoding, n_folds)
     85 
     86     # Create the kfold object
---> 87     k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)
     88 
     89     # Empty array for feature importances

NameError: name 'KFold' is not defined

などエラー出る。前のnoteのライブラリ指定をそのまま持ってきてmodel関数の前に置いてあげよう。

import lightgbm as lgb

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc

import matplotlib.pyplot as plt

そういえばKDEプロットも見るとか言っていたのがスルーされたような・・？

結果を投稿したらスコアは
0.77186
でした。
前回よりぐっと上がってる。

特徴量重要度はこんな感じでした。

今回作った
client_installments系が上位に来てますね。。

大好きなKDEプロットも見ておこう。

うーん、、？いい特徴量とは言えないような・・？

とりあえず終わり。

覚えた英語

occurrences 発生

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up