More than 3 years have passed since last update.

GINZAで形態素解析してWordCloud出力[少女地獄]

Last updated at 2022-10-02Posted at 2022-08-09

GINZAを使って形態素解析をし、WordCloudで出力します。青空文庫から「少女地獄」という本を選んで実行すると以下が出力されます。本は、全く読んだことも聞いたこともなくタイトルから中身がまるで想像できなかったものを選びました。病院で手術が関係しているという点だけがわかりました。

環境

Python3.7.13でGoogle Colaboratory上で動かしています。基本はGoogle Colabプリインストールされているパッケージはそのまま使っています。
spacyはGINZAとの互換性の関係で3.2.4にダウングレードしています。

Package	Version	備考
spacy	3.2.4	Google Colabプリインストールをダウングレード
matplotlib	3.2.2	Google Colabプリインストール
pandas	1.3.5	Google Colabプリインストール
ja_ginza_electra	5.1.0
ginza	5.1.1
wordcloud	1.8.2.2	Google Colabプリインストール

プログラム

0. インストール

0.1. Pythonパッケージインストール

pipでPythonパッケージインストール。spacyはGINZAとの互換性の関係で一度アンインストールして、3.2.4をインストールしています。
※ 2022/10/1に同じことを試したらGoogle Colabにプリインストールされているspacy3.4.1のままginza(ver5.1.2)とja_ginza_electra(ver5.1.2)をインストールできました。

!pip uninstall spacy --yes
!pip install spacy==3.2.4 ginza ja_ginza_electra

spacyインストール時にエラーが出たけどあまり重要ではなかったし、後続できたので無視しました。

Installing collected packages: spacy
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.4.0 requires spacy<3.5.0,>=3.4.0, but you have spacy 3.2.4 which is incompatible.
Successfully installed spacy-3.2.4
WARNING: The following packages were previously imported in this runtime:
  [spacy]
You must restart the runtime in order to use newly installed versions.

Mac(M1)でインストール時にはRustがなくてja_ginza_electraのインストール時にエラーとなったので公式ページを参考にRustをインストールあとにpip installしました。Macが原因かPython3.10が原因か不明。

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

0.2. 日本語フォントインストール

WordCloudで出力するための日本語フォントをインストール。

!apt -y install fonts-ipafont-gothic

1. Import

パッケージインポート。

import ginza
import matplotlib.pyplot as plt
import pandas as pd
import spacy
from wordcloud import WordCloud

2. ファイル読込

青空文庫からダウンロードしたテキストファイルを読み込みます。
正確には以下のステップ。

青空文庫ルビ削除ツールα版で「少女地獄」のXHTMLファイルリンクを入力してルビなし内容表示。
内容をテキストエディタにコピペしてファイル保存
Google Colabにアップロード

with open('shojo_jigoku.txt') as f:
    text = f.read()

3. 形態素解析

形態素解析してDataFrame化します。lemmaやposは使っていませんが、確認のためにDataFrameに入れました。1万文字限定にしても35秒程度かかっています。
set_split_modeでCを渡しているのは長めのTokenにしたかったからです(短い順にA, B, C)。

nlp = spacy.load('ja_ginza_electra')
ginza.set_split_mode(nlp, 'C')

doc = nlp(text[:10_000])
result_list = []
for sent in doc.sents:
    result_list = result_list + [[token.text, token.lemma_, token.pos_, token.tag_] for token in sent]
df = pd.DataFrame(result_list, columns = ['text', 'lemma', 'pos', 'tag'])

     token_no text lemma    pos           tag
0           0   少女    少女   NOUN    名詞-普通名詞-一般
1           1   地獄    地獄   NOUN    名詞-普通名詞-一般
2           2   \n    \n   NOUN            空白
3           3   夢野    夢野  PROPN  名詞-固有名詞-人名-姓
4           4   久作    久作  PROPN  名詞-固有名詞-人名-名
...       ...  ...   ...    ...           ...
6504     6504    前     前   NOUN  名詞-普通名詞-副詞可能
6505     6505    に     に    ADP        助詞-格助詞
6506     6506    先     先   NOUN  名詞-普通名詞-副詞可能
6507     6507    を     を    ADP        助詞-格助詞
6508     6508   越し    越す   VERB         動詞-一般

1万文字に絞らないとこんなエラーが出ました。今回は、処理可能な文字数を増やすことができるか調べていません。

Exception: Tokenization error: Input is too long, it can't be more than 49149 bytes, was 317947

4. WordCloud出力

名詞に絞ってWordCloud出力しました。

font_path_gothic = '/usr/share/fonts/opentype/ipafont-gothic/ipagp.ttf'
all_text = ' '.join(df[df['tag'].str.startswith('名詞')]['text'].to_list())

result = WordCloud(width=800, height=600, background_color='white', font_path=font_path_gothic).generate(all_text)
plt.figure(figsize=(12,10))
plt.imshow(result)
plt.axis('off')
plt.show()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up