2

More than 5 years have passed since last update.

[Survey] Kaggle - Quora 12位解法まとめ

Posted at 2017-09-17

Kaggle - Quora Question Pairs¹の12位解法²の調査記事です．

Author: CPMP
Title: Solution #12 overview
Kaggle Discussion: https://www.kaggle.com/c/quora-question-pairs/discussion/34342

特徴量

グラフ構造

質問をノード,質問1と2を辺としたグラフを構築³

ノードからの特徴量

ノードの次数(Jared's frequency)
連結成分の数
双方向連結成分の数．???
隣接成分の平均

エッジからの特徴量

グラフの辺の特徴量
共通の隣人の数
共通の近隣の数

生成パターン

質問が登場した順に並び替えたときの質問1と2の差の絶対値

推移律

(q1とq2は重複)かつ(q2とq3は重複)ならば(q1とq3は重複)
(q1とq2は重複)かつ(q2とq3は重複してない)ならば(q1とq3は重複してない)

NLP

分散表現

カーネル上の特徴量(by Anokas⁴, Abhishek⁵, ChenglongCheng⁶など)
生の質問, テキストクリーニングした質問, ステミングした質問, 見出し語での質問
分散表現(Word2Vec, GloVe, fastText, dbow2vec, lsi)
質問ベクトルの差の10次元PCA

NLP関連の特徴量

質問1と2の長さをl1,l2として，min(l1,l2), max(l1,l2)

Stacking

layer1で100のモデル(XGBoost, LightGBM, NN, gblinear, H20, sklearn, extra tree, logistic regression)
単一モデルは600以上の特徴量でのXGBoostでLB=0.131
layer3はLogistic Regression

References

Kaggle, Quora Question Pairs, 2017. ↩
CPMP, Solution #12 overview, 2017. ↩
airxiechao, What are the magic features of winners?, 2017. ↩
anocak ↩
radder, Abhishek's features ↩
ChenglongCHen, Kaggle_CrowdFlower ↩

2

Register as a new user and use Qiita more conveniently

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

2