More than 5 years have passed since last update.

【Week1まとめ】How to Win a Data Science Competition: Learn from Top Kagglers

Last updated at 2018-10-29Posted at 2018-10-28

この記事はなに？

Courseraで提供されているHow to Win a Data Science Competition: Learn from Top Kagglersの勉強用のまとめ。ポイントのみをまとめているので抜けているところはあり。

読まれる方へ

記事作成者の知識の程度は、Courseraで有名なMachine Learningを勉強した程度なので内容に間違いがあればごめんなさい（笑）

Welcome to How to win a data science competition

イントロのパート
このコースはロシアの大学のTop Kagglerが教えてくれるコース
一般的なデータサイエンスについて教えるものではなく、どのようにKaggleのようなコンペで良いランクを取るかにフォーカスしている
Course Overview
- Week1
  - コンペの概要紹介
  - Feature Preprocessing & Extraction
- Week2
  - EDA
  - Validation
  - Data Lakes
- Week3
  - Metrics
  - Mean-encodings （カテゴリ変数の処理）
    - Encodingの参考記事
- Week4
  - Advanced features(metric factorization, feature interactions, t-SNE)
  - Hyperparameter optimization
  - Ensembles(KazAnova, bagging, boosting, stacking)
    - Ensemblesの参考記事
- Week5
  - Final Project

Competition Mechanics

コンペの基本的な概念

Data
コンペ主催者側の提供データを利用
画像認識系は、一般に公開されているデータを活用していいケースもあるが、コンペのルールを要参照。
Model
Submission
Evaluation
- ACU, Logistic Loss, AUC, RMSE, MAE
Leaderboard
- Public Test: used during competition
- Private Test: used for the final ranking, which is revealed after the deadline

Recap of main ML algorithms

Scikit learn
Vowpal Wabbit: 大量データ向けにデザイン
Tree-based method
Non-tree-based method
No free lunch theorem
- Here is no method which outperforms all others for all tasks
- For every method, we can construct a task for which this particular method will not be the best
There is no silver bullet algorithm
Linear models split space into 2 subspaces
Tree-based methods split space into boxes
k-NN methods heavily rely on how to measure points "closeness"

The most powerful methods are GBDT and NN. But you should not underestimate other methods.

Quiz: Recap

初めて知ったポイントだけメモ
Q2
Suppose we've trained a RandomForest model with 100 trees. Consider two cases:

We drop the first tree in the model
We drop the last tree in the model
We then compare models performance on the train set. Select the right answer.

Correct answers:
In the case1 performance will be roughly the same as in the case2. In RandomForest model we average 100 similar performing trees, trained independently. So the order of trees does not matter in RandomForest and performance drop will be very similar on average.

Q3
Suppose we've trained a GBDT model with 100 trees with a fairly high learning rate. Consider two cases:

We drop the first tree in the model
We drop the last tree in the model
We then compare models performance on the train set. Select the right answer.
Correct answers:

In the case1 performance will drop more than in the case2. In GBDT model we have sequence of trees, each improve predictions of all previous. So, if we drop first tree — sum of all the rest trees will be biased and overall performance should drop. If we drop the last tree -- sum of all previous tree won't be affected, so performance will change insignificantly (in case we have enough trees)

Software/Hardware requirements

Hardwareの推奨スペック
Laptop

16+ gb ram
4+ cores

Tower PC　（講師が使用しているスペックはこれ）

32+ gb ram
6+ cores
※ SSDは、画像データを扱うなら重要。

Programming Asignment: Pandas Basics

Pandasでのデータ処理の課題が4問あります。私にとってはこれぐらいで十分難しかったです。

Feature preprocessing and generation with respect to models

Overview

コンペでは、与えられたデータを処理し、新しい特徴量を作成することが大切。

Preprocessing

Pclass	1	2	3
Target	1	0	1

Targetは、Pclassに非線形性とする。線形モデルのモデル改善のためにPclassのデータをOne-Hot encodingで処理することで、各ターゲットがどのPclassの特徴を持っているかをバイナリに置き換えて表現できる。（動画にあった表を少し変更）

ID	Target	Pclass==1	Pclass==2	Pclass==3
1	1	1	0	0
2	0	0	1	0
3	1	0	0	1

ランダムフォレストは、この処理を必要としない。

Feature Generation

週の売上予測のようなデータの場合、週番号を付与することでLinear Modelは、線形性に対応できる。
（One way to help module neutralize linear train is to add feature indicating the week number past. With this feature, linear model can successfully find an existing lineer and dependency. ）

GBDTの場合の説明もあったがイマイチ理解できず・・・

Numeric Features

Preprocessing: Scaling

Non-tree-based model

MinMaxScaler： 0〜1にスケールを整える。分布の形は変わらない。
StandardScaler：平均値を引いてから標準偏差で割る。

Preprocessing: Outliers

解析対象のデータの中でいくつかのデータだけが異常に大きい（小さい）値の場合の対処

データの対象範囲を1から99分位点（99s percentiles)までにする

Preprocessing: Rank

異常値の対処方法としては、MinMaxScalarよりも優れている。

rank([1000, 1, 10]) = [2, 0, 1]
Linear model, KNN, NNに有効。マニュアルで異常値を除外するよりも手っ取り早い。
Train dataとTest Dataは、Rank処理する前に連結しておく必要がある。

その他の処理

Log transform
rasing to the power <1 （平方根？）
NNに特に有効。

Feature Generation

Feature Generationに必要なのは

事前知識
EDA

シンプルな例

土地Aの平米数
土地Aの価格
→　1平米あたりの価格

Ordinal features

順序の特徴量
- チケットのクラス
- ドライバーライセンスの種類
- 最終学歴

Label　encoding

ABC順
出現順

Freaquency encoding

出現頻度
Linear model、Tree Modelともに有効。

Datetime and coordinates

Periodicity

日数

Time since

特定の期間、日付
特定の日付からの経過日数

Difference between dates

特定の日付間の差異

例えば、売上データの日付データ（20○○年○月○日）のデータから

曜日
特定の日付から何日過ぎたか
休日かどうか（バイナリ）
休日までの日数
といったFeatureを作成可能。

Coordinates（座標）
座標のFeatureとして使えるものの例

ある建物から別の目的物（近くのお店、病院、有名な学校など）への距離
最も地価の高いグリッドとの距離
クラスターの中心との距離
エリアの中の古い建物との距離
エリア周辺の建物とのAggregate Statics
あるエリアの平均地価
決定木を使う場合は、座標を22.5度、45度回転させるとモデルが良くなることがある

Missing Values

Fillna approaches

-999, -1, などの値を代入
mean, median
Reconstruct value

Feature extraction from texts and images

まとめはWIP

Final Project

実際にKaggle上でこのコース主催者はコンペを開いているみたい。
そこでの成績がFinal ProjectのGradeになるとのこと。
解析対象は、ロシアの会社が提供してくれた販売データ。
それを使って1ヶ月分の売上予測をするといったもの。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【Week1まとめ】How to Win a Data Science Competition: Learn from Top Kagglers

この記事はなに？

読まれる方へ

Welcome to How to win a data science competition

Competition Mechanics

Recap of main ML algorithms

Quiz: Recap

Software/Hardware requirements

Programming Asignment: Pandas Basics

Feature preprocessing and generation with respect to models

Overview

Preprocessing

Feature Generation

Numeric Features

Preprocessing: Scaling

Preprocessing: Outliers

Preprocessing: Rank

Feature Generation

Ordinal features

Label encoding

Freaquency encoding

Datetime and coordinates

Missing Values

Feature extraction from texts and images

Final Project

Label　encoding