More than 5 years have passed since last update.

Kaggle:Introduction to Manual Feature Engineering その4

Kaggle

Posted at 2020-02-07

Modelingから

KFold

    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)

交差検証の１つらしい。データを10個に分けたら、1個はテスト用、9個はトレーニング用、それを10回検証して、10個それぞれの分割したデータでテストが回るようにする。
k_fold.split()の返り値は渡した配列なりのindexが返ってくる。

いまいち分からない

    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])

feature_importance_valuesなど、ここらへんをなぜ0初期化する必要があるのか？NAが入ってくるのかな？
その後の関数で、

        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits

みたいにしてるし・・必要なんでしょうか。
と最後まで関数追ってて気付いた。これnumpy配列の加算をしてるのね。最初の加算をするのに確かに初期化がいりますね。

overfit

The control slightly overfits because the training score is higher than the validation score.

なるほどです。。が結果としては相関係数が高い特徴量削ったほうがスコアは低かったという。

ということで、Introduction to Manual Feature Engineering は終わり。

次は

Introduction: Manual Feature Engineering (part two)
https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering-p2

なのかな。何か目新しいことがありそうにもないが、Part1を自分で今出来るかと言ったら出来ないので、よい復習になることを期待する。

学んだ英語

worthwhile やりがいのある、価値のある
to be had (ある、手に入る）はずである　（ちょっと意味はあやしい）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up