More than 3 years have passed since last update.

形態素解析と単語のベクトル化してみた

Last updated at 2020-09-28Posted at 2020-09-28

Word2vecを使ってみる

word2vecを使うためにgensimのインストールをする。
文字処理をするためにjanomeをインストールする。

pip install gensim
pip install janome

word2vecで青空文庫を読み込むためのコード

# 必要なライブラリのインポート

from janome.tokenizer import Tokenizer
from gensim.models import word2vec
import re

# txtファイルをopenした後読む
binarydata = open("kazeno_matasaburo.txt").read()

# ちなみにprintして一つ一つ確かめたやつ
binarydata = open("kazeno_matasaburo.txt")
print(type(binarydata))

実行結果　　<class '_io.BufferedReader'>

binarydata = open("kazeno_matasaburo.txt").read()
print(type(binarydata))

実行結果　<class 'bytes'>

# データ型を文字列型に変換（pythonの書き方）
text = binarydata.decode('shift_jis')
# いらないデータを削ぎ落とす
text = re.split(r'\-{5,}',text)[2]
text = re.split(r'底本：',text)[0]
text = text.strip()

# 形態素解析を行う
t = Tokenizer()
results = []
lines = text.split("\r\n")  # 行ごとに分けられている

for line in lines:
    s = line
    s = s.replace('|','')
    s = re.sub(r'《.+?》','',s)
    s = re.sub(r'［＃.+?］','',s)
    tokens = t.tokenize(s)  # 解析したやつが入っている
    r = []
　　# 一つずつ取り出して、.base_formとか.surfaceとかでアクセスできる
    for token in tokens:
        if token.base_form == "*":
            w = token.surface
        else:
            w = token.base_form
        ps = token.part_of_speech
        hinshi = ps.split(',')[0]
        if hinshi in ['名詞','形容詞','動詞','記号']:
            r.append(w)
    rl = ("　".join(r)).strip()
    results.append(rl)
    print(rl)

# 解析したやつを書き込むファイルの生成と同時に書き込む
wakachigaki_file = "matasaburo.wakati"
with open(wakachigaki_file,'w', encoding='utf-8') as fp:
    fp.write('\n'.join(results))

# 解析スタート
data = word2vec.LineSentence(wakachigaki_file)
model = word2.Word2Vec(data,size=200,window=10,hs=1,min_count=2,sg=1)
model.save('matasaburo.model')

# model使ってみる
model.most_similar(positive=['学校'])

まとめ

①解析したい文章を取ってくる。
②文章だけになるように加工する。最後の参考文献みたいなやつとか取り除く
③for文で１行ずつ取り出して、いらない部分を取り除く。
④tokenizerで形態素解析をする。リストに入れる。
⑤作ったリストをファイルに書き込む
⑥形態素解析したファイルを使ってmodelを作る

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up