4
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

TF-IDFとWord Cloudで文書中のキーワードを可視化する

Last updated at Posted at 2020-06-13

word cloudのメモ
※TF-IDFは以前に投稿済み
https://qiita.com/y-s-y-s/items/c36498d40267555e116e
※Google Colaboratory上で確認(2020/06/13)

#単語辞書 (vocab) とTF-IDFを準備

#全ての単語 (下は例)
$ vocab
array(['a', 'able', 'at', ..., 'zebra', 'zone', 'zoo'], dtype='<U79')

#文書ごとのTF-IDFベクトル
$ TF_IDF
array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [61.9792226 ,  0.        ,  3.38385083, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  6.76770166, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 2.75463212,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.37731606,  2.84060202,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

#dic[document]=vecを作成

words = vocab.tolist()
vecs = TF_IDF.tolist()
temp_dic = {}
vecs_dic = []
for vec in vecs:
    for i in range(len(vec)):
        temp_dic[words[i]] = vec[i] 
    vecs_dic.append(temp_dic)
    temp_dic = {} 
$ len(vecs_dic)
(文書の数)

$ len(vecs_dic[0])
(ベクトルの次元数)

#可視化

#文書リストから89個目の文書を可視化
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import sys

wordcloud = WordCloud(background_color='white', width=1024, height=674)
wordcloud.generate_from_frequencies(vecs_dic[88])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.show()

image.png

#Word CloudでZeroDivisionError,Segmentation faultが出たら
参考文献[2]を参考に小さな値を足してあげることで解決しました.

words = vocab.tolist()
vecs = TF_IDF.tolist()
temp_dic = {}
vecs_dic = []
for vec in vecs:
    for i in range(len(vec)):
        temp_dic[words[i]] = vec[i] + 1e-5 #要素が0になるのを防ぐ
    vecs_dic.append(temp_dic)
    temp_dic = {} 

#文書ごとに画像を作成し,保存する
保存する場合にはwordcloud.to_fileを加えて以下のように変更する.

i=0
for v in vecs_dic:
  i+=1
  wordcloud = WordCloud(background_color='white', width=1024, height=674)
  wordcloud.generate_from_frequencies(v)
  wordcloud.to_file([PATH] + str(i) + ".png")

#参考文献
[1] https://qiita.com/pma1013/items/d183b4b2504173ba037e
[2] https://github.com/amueller/word_cloud/issues/456

4
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?