More than 5 years have passed since last update.

fastTextで単語分散表現

Last updated at 2019-06-14Posted at 2019-06-14

fastTextを使うための自分用のメモ

ドキュメント

https://fasttext.cc/docs/en/support.html

パラメータシート

https://fasttext.cc/docs/en/options.html

インストール

cloneしてmakeする

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make

以下にfastTextのコマンドがある。fastTextはword2vec同様gensimからも学習できるようですが、精度が変わったりするらしくコマンドで学習することが推奨されているとか...
参考
https://github.com/mitsuharu/ml_tutorials/blob/master/nlp/nlp_fasttext.md

$ ls fastText/fasttext
fastText/fasttext

単語分散表現を作る

とはいってもpythonから使いたいので、pythonから外部コマンドを実行する形で学習させてみる。
コーパスはlivedoorニュースコーパス使います。

作成されるモデルのうち、「〜.bin」は結構重い。（今回の例だと1.5GBもある）

# 必要なものを諸々import
import re, os, MeCab
from glob import glob
from gensim.models.wrappers.fasttext import FastText

# 分かち書きはMeCabで適当に
tagger = MeCab.Tagger("-Owakati")
def make_wakati(sentence):
    # MeCabで分かち書き
    sentence = tagger.parse(sentence)
    # 半角全角英数字除去
    sentence = re.sub(r'[0-9０-９a-zA-Zａ-ｚＡ-Ｚ]+', " ", sentence)
    # 記号もろもろ除去
    sentence = re.sub(r'[\．_－―─！＠＃＄％＾＆\-‐|\\＊\“（）＿■×+α※÷⇒—●★☆〇◎◆▼◇△□(：〜～＋=)／*&^%$#@!~`){}［］…\[\]\"\'\”\’:;<>?＜＞〔〕〈〉？、。・,\./『』【】「」→←○《》≪≫\n\u3000]+', "", sentence)
    # スペースで区切って形態素の配列へ
    wakati = sentence.split(" ")
    # 空の要素は削除
    wakati = list(filter(("").__ne__, wakati))
    return wakati


# livedoorニュースコーパスを全て分かち書きして１つのファイルに書き込む
# カテゴリを配列で取得
categories = [name for name in os.listdir('text') if os.path.isdir("text/" +name)]
with open("corpus_for_fasttext.txt", "w", encoding="utf-8") as w:
    for cat in categories:
        path = "text/" + cat + "/*.txt"
        files = glob(path)
        for text_name in files:
            with open(text_name, "r", encoding="utf-8") as f:
                data = f.read()
                wakati = make_wakati(data)
                w.write(" ".join(wakati) + "\n")

# fastTextのコマンドを指定
fasttext = "/Users/〜〜/fastText/fasttext"
# アルゴリズム指定。skipgramにしとく
algorithm = "skipgram"
# fastTextで学習する分かち書きされたコーパスデータ
input_file = "-input corpus_for_fasttext.txt"
# 分散表現のモデル名
# この場合カレントディレクトリに「fasttext.model.bin」と「fasttext.model.vec」の２つのファイルができる
output_model = "-output fasttext.model"

# 諸々のパラメータを指定
feature = "-dim 200"
negative = "-neg 10"
window_size = "-ws 5"
epoch="-epoch 25"
# fastTextコマンドをpythonから呼び出すために全部がっちゃんこする
command_list = [fasttext, algorithm, input_file, output_model, feature, negative, window_size, epoch]
command = " ".join(command_list)

print("Learning for fasttext ...")
os.system(command)
print("Done.")

# モデル読み込み
# fastTextでモデルを読み込む時は .bin、.vecの指定はしないでロード
model_name = "fasttext.model"
model = FastText.load_fasttext_format(model_name)

# 確認。精度は一旦スルー。とりあえず分散表現が作成されたのでOK。
model.most_similar("男性")
# [('女性', 0.7897492051124573),
# ('男', 0.5839448571205139),
# ('女', 0.5330237150192261),
# ('人', 0.5243622064590454),
# ('代', 0.5157191157341003),
# ('歳', 0.515695333480835),
# ('彼', 0.4998660385608673),
# ('彼氏', 0.4992670714855194),
# ('既婚', 0.49713438749313354),
# ('佳苗', 0.48813962936401367)]

その他

progress

jupyter notebookから実行すると、terminalに以下のようなprogressがでます。

Read 3M words
Number of words:  24996
Number of labels: 0
Progress:  14.5% words/sec/thread:    2450 lr:  0.042752 loss:  2.506431 ETA:   0h 9m

subwordを使うが故の欠点？

サッカーに近い単語を調べてみると、以下のように近い意味がめちゃくちゃでした...

model.most_similar("サッカー")
# [('一体化', 0.7122113108634949),
# ('マイカー', 0.5638254880905151),
# ('パトカー', 0.5627922415733337),
# ('レンタカー', 0.5429742336273193),
# ('ウォーカー', 0.541737973690033),
# ('ミニカー', 0.5158187747001648),
# ('スピーカー', 0.49844545125961304),
# ('ニューヨーカー', 0.49109819531440735),
# ('ドラッカー', 0.4784988462924957),
# ('コインロッカー', 0.475729763507843)]

以下で言及してくださっている方がいました

fastTextのsubword(部分語)の弊害

fastTextはword2vecよりも性能がいいからword2vec使うならfastText使えばいいじゃん、なんて考えをたまに聞きますが、それはちょっと安直で、word2vec、fastTextそれぞれのメリデメをよく理解した上で自分が解きたいタスクや抽出したい意味をよく理解した上でどちらを使うかを検討したほうがよい、と思った。

終わり

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up