More than 5 years have passed since last update.

gensimのDirectory関数でハマった件(df値)

Last updated at 2018-11-20Posted at 2018-11-20

0.この記事は

・主に潜在的意味解析に使うgensimをpythonで使った際、df値が全て１になってしまい調査したメモです。

1.まとめ

・１配列の中に同一文字列を100個詰め込んでテストしてもdf値は1。gensim.corpora.Dictionary()はinputの単語頻度計測関数ではない。
・文章毎に配列を分けることで、**「文と文の間で、単語が何回使われているか」**を計算してくれるのがgensim.corpora.Dictionary()の仕組み。

2.gensim.corpora.Dictionary()の考え方

sample1.py

・Dictionary(['hoge','hoge','hoge','hoge']) #df値1,ジーザス!!

**１配列内に何回hogeが出てもdf値は1です。**要注意。

sample2.py

・Dictionary([['hoge'],['hoge']],[['hoge'],['hoge']]) = #df値2
・Dictionary(['hoge'],['hoge'],['hoge'],['hoge']) = #df値4

配列を２つに分けるとdf値2、配列を４つに分けるとdf値4。

3.mecabで形態素分析する際に、１文毎にまとめて配列化することで解決。

配列「words」を後続にまわしてcorpora.Dictionary(words)でOK。
ダブルシャープ(##)のコメント行にご注目。

sample3.py

# ファイルを読込み前は省略
with open(file_name, 'r',encoding="utf-8") as f:
    reader = f.readline()
    while reader:
      ## 取り込んだ行毎に単語を配列化する変数readercolectを定義
      readercolect =[]
      #Mecabで形態素解析を実施
      parse = mecab.parse(reader)
      lines = parse.split('\n')
      items = (re.split('[\t,]', line) for line in lines)
      # 名詞をリストに格納
      for item in items:
        wcnt = wcnt +1
        if (item[0] not in ('EOS', '', 't','RT') and item[1] == '名詞'):
          ##reader内の配列に一旦格納する。
          readercolect.append(item[0])
      ## 行内の形態素を全て処理したら、後続処理に引き渡すための配列に格納する。
      words.append(readercolect)
      reader = f.readline()
print(words)

以上です。ありがとうございます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up