5

More than 5 years have passed since last update.

文章を特徴量ベクトルに変換する日本語の学習済みモデルまとめ

Last updated at 2020-03-19Posted at 2020-03-19

概要

input:日本語の文章、output:実数で回帰したい。
そのためにまず、文章を特徴量ベクトルに変換したい。
アンサンブル学習で性能を高めるために、異なるモデル、異なるデータセットで学習した、学習済みモデルがたくさん欲しい
学習済みモデルを探した。

文章

Universal Sentence Encoder (multilingual)

BERT (multilingual)

nnlm

doc2vec

単語

Word2Vec

Wikipediaで学習

FastText

Wikipedia + Common Crawl (mecab)

Wikipedia (mecab NEologd)

Byte-Pair Encoding

Wikipedia

Wikipedia2Vec

Wikipedia

その他文章特徴量

品詞、ひらがな、カタカナ、英数字の回数または割合

エントロピー

単語長

文章難易度

帯はrubyのファイルがダウンロードできなくなっていた。

ネガポジ

kaggleで見つけたテクニック

kaggle: Toxic Comment Classification Challenge まとめ

やはり、埋め込みベクトルのアンサンブルが重要らしい。

あと、翻訳によるdata augmentation。

参考リンク

5

Register as a new user and use Qiita more conveniently

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

5