More than 1 year has passed since last update.

PythonでMeCabを使う際のメモ2

Last updated at 2023-07-07Posted at 2023-07-07

はじめに

PythonでMeCabを利用した形態素解析をする際のメモを過去に残したことがあるが、最近いじる機会があり、新たに少し分かったことがあるため、その内容を追記しようと思う。

前提条件

【PC環境】
　　Windows 10 Pro　
【ローカル環境のpython.Ver】
　　Python 3.9.13
【仮想環境のpython.Ver】
　　Python 3.9.13

仮想環境を作成することはマストではないが、一旦テストが終わったら丸っと削除するつもりなので、今回は仮想環境で作業している。

メモ内容

１．PythonライブラリのMecabの、Taggerの動きについて
２．PythonからMecabを利用する際、-Oyomi とか使えなくなる事があるのはなぜ？

１．PythonライブラリのMecabの、Taggerの動きについて

PythonでMeCabを利用する場合、pip install mecab-python3 でライブラリを入れて、jupyter notebook 等で、以下の様なコードを記述してインスタンスを作成することが多いかと思う。

サンプルコード

import Mecab

tagger = MeCab.Tagger()

この時、そもそもの動きについて、Taggerクラスの中身を確認してみたところ、以下の事が分かった。

・まずは、import unidic または import unidic_lite が成立するかを確かめている。

・もし成立した場合は、リソースファイルや辞書フォルダへのデフォルトパスをそれらのパスにセットする。

・Taggerクラスの引数に -r [任意のリソースファイル] -d [任意の辞書フォルダのパス] を入れることで、使う辞書を変更することは可能。

・形態素解析をする際にutf-8に変換して投げていそう？（ここら辺の解読は完璧ではないが、PythonからOS側のMecabを利用する際に、utf-8でMecabをインストールしないと文字化けして上手くいかないのはこれが原因？）

pythonのライブラリフォルダ/site-packages/MeCab/__init__.py

class Tagger(_MeCab.Tagger):
    def __init__(self, rawargs=""):
        # First check for Unidic.　ここでまず 実はunidic OR unidic_lite をデフォルト辞書として設定しようとしている。
        unidicdir = try_import_unidic()
        args = rawargs
        if unidicdir:
            mecabrc = os.path.join(unidicdir, 'mecabrc')
            args = '-r "{}" -d "{}" '.format(mecabrc, unidicdir) + args

        # The first argument here isn't used. In the MeCab binary the argc and
        # argv from the shell are re-used, so the first element will be the
        # binary name.
        args = ['', '-C'] + shlex.split(args)

        # need to encode the strings to bytes, see here:
        # https://stackoverflow.com/questions/48391926/python-swig-in-typemap-does-not-work
        args = [x.encode('utf-8') for x in args]

        try:
            super(Tagger, self).__init__(args)
        except RuntimeError as ee:
            raise RuntimeError(error_info(rawargs)) from ee


def try_import_unidic():
    """Import unidic or unidic-lite if available. Return dicdir.

    This is specifically for dictionaries installed via pip.
    """
    try:
        import unidic
        return unidic.DICDIR
    except ImportError:
        try:
            import unidic_lite
            return unidic_lite.DICDIR
        except ImportError:
            # This is OK, just give up.
            return

２．PythonからMecabを利用する際、`-Oyomi` とか使えなくなる事があるのはなぜ？

まず、MeCab.Tagger("-Oyomi") などの意味についてだが、-O**** は『アウトプット形式をどうするか？』という設定オプションになる。

先程の１章で記述した通り、特に -r [] -d [] を明記していなければ、デフォルトで unidic や unidic_lite の辞書を参照している可能性があり、そちらのリソースファイル（dicrc や mecabrc）に記述がなければ、-Oは使えないことになる。

unidic_lite のデフォルトのmecabrc

# This is a dummy file
# It has to exist, but it can be empty

上記の様に何も書かれていない状態。
以下の様に書き換えれば、-Oyomi や -Ochasen が使える様になるはず。

unidic_lite のmecabrc に以下の様に追記してみる。

# This is a dummy file
# It has to exist, but it can be empty

; yomi
node-format-yomi = %pS%f[7]
unk-format-yomi = %M
eos-format-yomi  = \n

; ChaSen
node-format-chasen = %m\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
unk-format-chasen  = %m\t%m\t%m\t%F-[0,1,2,3]\t\t\n
eos-format-chasen  = EOS\n

デフォルト辞書がunidic_liteなどになっていたとしても、もちろん、OS側に入れたMecabのipadicを任意の場所に置いて、そちらを明示的に読み込む方法でも良い。

# unidicがある場合、デフォルトはそちら辞書を使おうとするので、明示的に上書きする必要がある。
#tagger = MeCab.Tagger ("")
tagger = MeCab.Tagger ("-r C:/ipadic/dicrc -d C://ipadic -Ochasen")

※上記方法でも辞書を明示的に選択できるため、OS側のMecabがSHIFT-JISで構成されていたとしても、一度utf-8でリコンパイルした辞書（ipadic）を作り、それをどこかに保管しておき、その辞書を -r -d オプションで指定してあげれば、現状のOSのMecab文字コードに関係なく使える。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up