3
7

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

テキストを形態素解析して品詞ごとに色付けする

Last updated at Posted at 2022-11-20

はじめに

  • テキストを形態素解析してこんな感じで品詞ごとに色付けしたい
    英文
    visualize_pos_en.png
    和文
    visualize_pos_jp.png

  • 英語はnltk, 日本語はjanomeを使って形態素解析する

準備

使用するライブラリをインストールし、importします。

!pip install spacy
!pip install janome # for JP
!pip install nltk # for EN
import pandas as pd
from spacy import displacy
# for EN
import nltk
from nltk.tokenize import TreebankWordTokenizer as twt
# for JP
from janome.tokenizer import Tokenizer

コード

【English】nltkで形態素解析する

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
text = 'Sentense Colorization with nltk Tokenizer'
tokens = nltk.word_tokenize(text)
print(tokens)

['Sentense', 'Colorization', 'with', 'nltk', 'Tokenizer']

nltk.pos_tagで単語と品詞をtupleで受け取ることが出来ます。

tags = nltk.pos_tag(tokens)
print(tags)

[('Sentense', 'JJ'), ('Colorization', 'NN'), ('with', 'IN'), ('nltk', 'JJ'), ('Tokenizer', 'NNP')]

品詞の一覧はこのようになっています。
english_tag_table.png
https://www.nltk.org/book/ch05.html

各品詞に対して色を割り当てます。

pos_tags = ["PRON", "VERB", "NOUN", "ADJ", "ADP", "ADV", "CONJ", "DET", "NUM", "PRT"]
colors = {"PRON": "blueviolet",
          "VERB": "lightpink",
          "NOUN": "turquoise",
          "ADJ" : "lime",
          "ADP" : "khaki",
          "ADV" : "orange",
          "CONJ" : "cornflowerblue",
          "DET" : "forestgreen",
          "NUM" : "salmon",
          "PRT" : "yellow"}
options = {"ents": pos_tags, "colors": colors}

品詞と色の対応と、各品詞のstart, end, tagをdisplacy.renderに指定することで可視化する関数を実装します。

def visualize_pos(text):
    # Tokenize text and pos tag each token
    tokens = twt().tokenize(text)
    tags = nltk.pos_tag(tokens, tagset = "universal")

    # Get start and end index (span) for each token
    span_generator = twt().span_tokenize(text)
    spans = [span for span in span_generator]

    # Create dictionary with start index, end index, 
    # pos_tag for each token
    ents = []
    for tag, span in zip(tags, spans):
        if tag[1] in pos_tags:
            ents.append({"start" : span[0], 
                         "end" : span[1], 
                         "label" : tag[1] })

    doc = {"text" : text, "ents" : ents}
    displacy.render(doc, 
                    style = "ent", 
                    options = options, 
                    manual = True,
                   )
visualize_pos("Sentense Colorization with nltk Tokenizer")

visualize_pos_en.png

【Japanese】janomeで形態素解析する

janome.Tokenizerを使用することで形態素解析ができます。
tokenはgeneraterで返されます。
単語はtoken.surface、品詞はtoken.part_of_speechでアクセスできます。
Tokenizerクラスの実装はこちら

tokenizer = Tokenizer()
text = '日本語の文章を形態素解析し、品詞を可視化します'
tokens = tokenizer.tokenize(text)
for token in tokens:
    print(token)

日本語 名詞,一般,,,,,日本語,ニホンゴ,ニホンゴ
の 助詞,連体化,,,,,の,ノ,ノ
文章 名詞,一般,,,,,文章,ブンショウ,ブンショー
を 助詞,格助詞,一般,,,,を,ヲ,ヲ
形態素 名詞,一般,
,,,,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,
,,,,解析,カイセキ,カイセキ
し 動詞,自立,
,,サ変・スル,連用形,する,シ,シ
、 記号,読点,
,,,,、,、,、
品詞 名詞,一般,
,,,,品詞,ヒンシ,ヒンシ
を 助詞,格助詞,一般,
,,,を,ヲ,ヲ
可視 名詞,一般,,,,,可視,カシ,カシ
化 名詞,接尾,サ変接続,,,,化,カ,カ
し 動詞,自立,
,,サ変・スル,連用形,する,シ,シ
ます 助動詞,
,,,特殊・マス,基本形,ます,マス,マス

可視化の流れは前章と同じなので説明は割愛します。

pos_tags = ["名詞", "動詞", "形容詞", "助詞", "助動詞", "接続詞", "接頭詞", "記号", "副詞", "その他"]
colors = {"形容詞": "blueviolet",
          "動詞": "lightpink",
          "助詞": "turquoise",
          "名詞" : "lime",
          "助動詞" : "khaki",
          "接続詞" : "orange",
          "接頭詞" : "cornflowerblue",
          "その他" : "forestgreen",
          "副詞" : "salmon",
          "記号" : "yellow"}
options = {"ents": pos_tags, "colors": colors}

def get_pos_tag(token):
    return token.part_of_speech.split(',')[0]

def visualize_pos(text):
    # Tokenize text and pos tag each token
    tokens = tokenizer.tokenize(text)

    ents = []
    start_index = 0
    for token in tokens:
        ents.append({
            "start": start_index,
            "end": start_index + len(token.surface),
            "label": get_pos_tag(token)}
            )
        start_index += len(token.surface)

    doc = {"text" : text, "ents" : ents}
    displacy.render(doc, 
                    style = "ent", 
                    options = options, 
                    manual = True,
                   )
visualize_pos(text)

visualize_pos_jp.png

References

3
7
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
7

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?