More than 5 years have passed since last update.

Kaggle:Introduction to Manual Feature Engineering その3

Kaggle

Last updated at 2020-02-07Posted at 2020-02-06

Calculate Information for Testing Data

で、testデータに今まで計算したbureau_countsとかをmergeしてるけど、testデータはtestデータでまた別のbureau的なデータがあるものだと思ってた。SK_ID_CURRを見たら、trainとtestと両方入っているのだろうか。

train_labels = train['TARGET']

# Align the dataframes, this will remove the 'TARGET' column
train, test = train.align(test, join = 'inner', axis = 1)

train['TARGET'] = train_labels

ここらへんの、trainとtestのcolumnsを揃えるのも慣れてきた。
predictする時によくエラーが出ていたものです・・

html(CSS)でも、 test-align:centerみたいな指定があったが、alignは「揃える」という意味だったのね。。

kde_target(var_name='client_bureau_balance_counts_mean', df=train)

あれ、上記だとエラーになってします、と思ったら元のnoteもエラーになっていたという笑

おそらく、

kde_target(var_name='client_bureau_balance_MONTHS_BALANCE_count_mean', df=train)

この図だと思われる。KDEの図は本当好き。

Collinear Variables

Collinear Variablesとは何なのか・・
検索すると、
多重共線性 multicollinearity
と同じと出てくるだが本当にそうなのか。
まあ読んだ意味からすると似たようなものだろう。
Targetとのcorrelationだけでなく、variablesそれぞれのcorrelationsを見て、高いものは外していくほうがよい、と。

Notebookで変数の中身を知りたい

いわゆるprint debugしたい！という時、みたいところの直前で

from IPython.core.debugger import Pdb; Pdb().set_trace()

でconsole出てくるので、そこでprint debugできるよ、と。便利。

学んだpython

set()

リストを渡すと重複が取り除かれる。というように使っているが本来は集合を作るものらしい・・
https://qiita.com/Tocyuki/items/0bc783daab382ef7a0ec

学んだ英語

skeptically 懐疑的

次はModelingから。
https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering#Modeling

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up