More than 5 years have passed since last update.

MeCab + Unidic を使って単語の語種（和語、漢語）を表示する

Posted at 2017-09-05

はじめに

文章の難易度判定をするときなど、日本由来の言葉（和語）と大陸由来の言葉（漢語）を区別したい時があります。例えば「代替案」は漢語ですが「今日」は和語です。一見すると両者を上手く見分ける方法はないと思われるかもしれませんが、MeCabの辞書として国立国語研究所作成のUniDicを使えば形態素解析と同時に漢語・和語・外来語などの区別も行ってくれます。

環境

Ubuntu 16.04
Python 3.5.3
MeCab 0.996
Unidic 2.12

既にMeCab, Unidic はインストール済のものとしてすすめます。

下準備1：Unidicオプションの設定

/usr/local/lib/mecab/dic/unidicにアクセスし、中にあるdicrcを編集します。

output-format-type = unidic2

node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
unk-format-unidic  = %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
bos-format-unidic  =
eos-format-unidic  = EOS\n

node-format-chamame = \t%m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
;unk-format-chamame = \t%m\t\t\t%m\tUNK\t\t\n
unk-format-chamame  = \t%m\t\t\t%m\t%F-[0,1,2,3]\t\t\n
bos-format-chamame  = B
eos-format-chamame  = 

node-format-unidic2 = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\t%f[12]\n
unk-format-unidic2  = %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
bos-format-unidic2  =
eos-format-unidic2  = EOS\n

最後の４行が追加部分です。初期設定では語種を表すf[12]が出力フォーマットに入っていないため語種の情報が返されません。
そのためf[12]を指定したフォーマットを新たに作成しunidic2とおいて、output-format-typeにunidic2を指定します。

下準備2：MeCab辞書の変更

規定の辞書はipadicですが、これをunidicに変更します。MeCab設定ファイルを以下のように編集します。

/usr/local/etc/mecabrc

dicdir =  /usr/local/lib/mecab/dic/unidic

仕上げ：Pythonコードの作成

例えば下のような関数を定義すれば

def gosyu(text):
    t = MeCab.Tagger()
    node = t.parseToNode(text)
    words = []
    while(node):
        info = node.feature.split(',')
        if info[6] != '*':
            word_type = info[0]
            if word_type not in ['記号']:
                words.append([info[8], info[12]]) #原型を抽出
        node = node.next
        if node is None:
            break
    return words

>>> gosyu("今日の会議でパートナーに代替案を提出する予定だ")

[['今日', '和'],
 ['の', '和'],
 ['会議', '漢'],
 ['で', '和'],
 ['パートナー', '外'],
 ['に', '和'],
 ['代替', '漢'],
 ['案', '漢'],
 ['を', '和'],
 ['提出', '漢'],
 ['する', '和'],
 ['予定', '漢'],
 ['だ', '和']]

語種が抽出出来た！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up