1

More than 5 years have passed since last update.

[Survey] Kaggle - Quora 8位解法まとめ

4

Posted at 2017-09-15

Kaggle - Quora Question Pairs¹の8位解法²の調査記事です．

Title: 8th solution with part of source code(under construction)
Names: qianqian, Fengari, hhy et al.
Discussion URL: https://www.kaggle.com/c/quora-question-pairs/discussion/34371
Code: https://github.com/qqgeogor/kaggle-quora-solution-8th

前処理

句読点の除去
porter stemmerを利用
ステミングしたコーパスでのunigram, bigramの生成
ステミングしたコーパスでの独立したunigram, bigramの生成

特徴量

特徴量(qian)

質問の単語や文字の数や割合
共通する単語の数や割合
JaccardやDice距離
質問中の数字や句読点の数や比率
narange(1,2)でのコーパスのTF-IDF値
unigram, bigramのTF-IDF値
独立したunigram, bigramのTF-IDF値
異なる単語のunigram, bigramの共起のTF-IDF値
gensimでのTF-IDF値の類似度
学習済みと独自に学習したWord2Vecの平均値での分散表現(idfを重みとした)
学習済みと独自に学習したGloVeの平均での分散表現(idfを重みとした)
NMF, SVD, LDA(sklearn使用)によるTF-IDFの次元削減
学習済みと独自に学習したWord2VecやGloveでの異なる単語ペアの類似度
クリーク中のノードの数
TF-IDF類似度(sklearn使用)
nodeとする質問のdeepwalk embedding
平均，最大，最小，標準偏差での単語の共起をエンコードするためのラベル
fuzzの特徴量
spaCyによるNER
unigramやbigramのsimhash
隣接行列のDecomposition
IDFで重み付けした学習済みGloVeによる分散表現の平均
近傍の次数の平均
WordNetによるAggregated Distinct Words
質問表現でのエントロピー

特徴量(fengari)

N-gramのdecomposition(NMF, SVD, LSI, LDA)
diff N-gramのdecomposition(NMF, SVD, LSI, LDA)
上記decompositionの類似度や距離
辺の最大クリーク
ノードの最大クリーク
グラフのbfs(depth=2)の数
rankingでの重複した特徴量
質問間の数の差
PageRank(有向版と無向版)
全てのリークした特徴量でのt-SNE
doc2vec and doc2vec sim features ???

特徴量(hhy)

学習済みGloVeでの分散表現の平均の類似度と距離
Word2Vecのdecomposition
共通する特徴量
spaCyによるNLP特徴量(トークンのlog尤度, brown cluster, pos tag, dependency, 品詞)
Stanford NLPを用いた構文木による特徴量
WordNetによる類似度
ストップワード関連の特徴量や文字の分布
Word Mover's Distance
N-gram関連の特徴量(BLEU metric, indicator, pos link, position change, pos tag compareなど)
N-gram関連のDecomosition(NMF, SVD)
近傍での特徴量(グラフ構造上の特徴量?)
近傍での意味的な特徴量
近傍の距離の比較(long match, 編集距離， Jaccard, Dice, Word Mover's Distance)
NLPでの類似度
Deep Learning(Siamse, Siamse Match, BiMPM)

qianのモデルと特徴量

方式1: 基本的な特徴量とdecomposition特徴量lgb, xgb, et, rf, mlp
方式2: クリークで重み付けした基本の特徴量+TF-DIFの特徴量
方式3: LSTM, LSTM with Attention, Siamese

Model stacking

5-fold
layer 1: lgb, xgb, 密な特徴量でのMLP, 疎なTFIDFでのMLP, et, rfなど
layer 2: lgb, xgb, mlp, rt, etの平均

References

Kaggle, Quora Question Pairs, 2017. ↩
qianqian, 8th solution with part of source code(under construction), 2017. ↩

1

Register as a new user and use Qiita more conveniently

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

1