More than 5 years have passed since last update.

python3 + mecabでnode.surfaceが取得できないバグへの対応

Posted at 2016-01-28

日本語形態素解析ができる「mecab」。
ツールとしても優秀だし、各プログラミング言語に組み込まれていろいろなところで使われている。

ところが、Python3上で実装すると「本来文字が取得できるはずのnode.surfaceで文字が取得できずエラーになる」ということがある。
そんな時の対応メモ。

実行環境

MacOS X Yosemite
Python 3.4.4 :: Anaconda 2.4.1
mecab-python3 0.7

以下のようにするとバグが発生する。

tagger = MeCab.Tagger('-Ochasen')
node = tagger.parseToNode(sentence)
while node:
	print(node.surface) # <= 文字が取得できずエンコードエラーが発生
	node = node.next

これへの対応は「空文字列をparseした後に目的の対象の文字列をparseする」とうまくいく。（参考: Ubuntu14.04とPython3でMeCabを使う方法）

tagger = MeCab.Tagger('-Ochasen')
tagger.parse('') # <= 空文字列をparseする
node = tagger.parseToNode(sentence)
while node:
	print(node.surface) # <= 文字が取得できる！
	node = node.next

なぜかはよくわからないが、これは既知のバグっぽい。
トラップすぎるから早く対応しておいてもらいたい...

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up