More than 5 years have passed since last update.

PythonからMeCabを使っていてRuntimeErrorが出た時の解決方法

Posted at 2019-06-25

PythonでMeCabを使っているときにRuntimeErrorが出た時の解析メモ

状況

Traceback (most recent call last):
  File "hoge.py", line 104, in <module>
    run()
  File "hoge.py", line 60, in run
    tagger = MeCab.Tagger('-u user_term.dic')
  File "/home/bwtakacy/Develop/hoge/venv/lib/python3.6/site-packages/MeCab.py", line 307, in __init__
    this = _MeCab.new_Tagger(*args)
RuntimeError

まず試すこと

PythonのMeCabバインディングだとMeCabのエラーを返せずRuntimeErrorになるらしい。なので、コマンドラインからMeCabを呼び出して実際に起きているエラーが何なのかを確認する。

今回の場合

$ mecab -u user_term.dic
viterbi.cpp(50) [tokenizer_->open(param)] tokenizer.cpp(130) [sysdic->isCompatible(*d)] incompatible dictionary: user_term.dic

システム辞書とユーザ辞書のバージョンやエンコーディングが一致していないときに発生する様子。

ref: https://github.com/taku910/mecab/blob/3a07c4eefaffb4e7a0690a7f4e5e0263d3ddb8a3/mecab/src/tokenizer.cpp#L131

環境のMeCabの辞書周りの設定がどうなっているかは、 -D オプションで確認できる。

$ mecab -D
filename:	/usr/local/lib/mecab/dic/mecab-ipadic-neologd/sys.dic
version:	102
charset:	UTF8
type:	0
size:	4542610
left size:	1316
right size:	1316

filename:	/home/bwtakacy/Develop/hoge/user_term.dic
version:	102
charset:	EUC-JP
type:	1
size:	6095
left size:	1316
right size:	1316

解決方法

今回はシステム辞書とユーザ辞書のエンコーディングが違っていたのが原因だった。
mecab-dict-index コマンドでユーザ辞書をコンパイルする際に、出力のエンコーディングを指定していなかったので、-t オプションでutf-8を指定してあげることで解決できた。

mecab-dict-indexのオプションは以下の通り。

$ /usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index --help
MeCab: Yet Another Part-of-Speech and Morphological Analyzer

Copyright(C) 2001-2012 Taku Kudo 
Copyright(C) 2004-2008 Nippon Telegraph and Telephone Corporation

Usage: /usr/local/Cellar/mecab/0.996/libexec/mecab/mecab-dict-index [options] files
 -d, --dicdir=DIR                    set DIR as dic dir (default ".")
 -o, --outdir=DIR                    set DIR as output dir (default ".")
 -m, --model=FILE                    use FILE as model file
 -u, --userdic=FILE                  build user dictionary
 -a, --assign-user-dictionary-costs  only assign costs/ids to user dictionary
 -U, --build-unknown                 build parameters for unknown words
 -M, --build-model                   build model file
 -C, --build-charcategory            build character category maps
 -s, --build-sysdic                  build system dictionary
 -m, --build-matrix                  build connection matrix
 -c, --charset=ENC                   make charset of binary dictionary ENC (default EUC-JP)
 -t, --charset=ENC                   alias of -c
 -f, --dictionary-charset=ENC        assume charset of input CSVs as ENC (default EUC-JP)
 -w, --wakati                        build wakati-gaki only dictionary
 -p, --posid                         assign Part-of-speech id
 -F, --node-format=STR               use STR as the user defined node format
 -v, --version                       show the version and exit.
 -h, --help                          show this help and exit.

以上。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up