More than 5 years have passed since last update.

HuggingFaceのBertJapaneseTokenizerで分かち書きしようとするとMeCabのinitializingで落ちたりencodeでも落ちたりする

Last updated at 2020-07-29Posted at 2020-07-29

同じエラーで躓いた人の調べる時間が少しでも減りますように、メモとして記事に起こしておきます。

以下はGoogle Colab上で実行しています。

こちらを参考にColab上でMeCabとhuggingfaceのtransformersをインストールします。

!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3
!pip install transformers

日本語用BERTのtokenizerで分かち書きを試みます。

from transformers.tokenization_bert_japanese import BertJapaneseTokenizer

# 日本語BERT用のtokenizerを宣言
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

text = "自然言語処理はとても楽しい。"

wakati_ids = tokenizer.encode(text, return_tensors='pt')
print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
print(wakati_ids)

以下のようなエラーがでてしまいました。

----------------------------------------------------------

Failed initializing MeCab. Please see the README for possible solutions:

    https://github.com/SamuraiT/mecab-python3#common-issues

If you are still having trouble, please file an issue here, and include the
ERROR DETAILS below:

    https://github.com/SamuraiT/mecab-python3/issues

issueを英語で書く必要はありません。

------------------- ERROR DETAILS ------------------------
arguments: 
error message: [ifs] no such file or directory: /usr/local/etc/mecabrc
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-f828f6470517> in <module>()
      2 
      3 # 日本語BERT用のtokenizerを宣言
----> 4 tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
      5 
      6 text = "自然言語処理はとても楽しい。"

4 frames
/usr/local/lib/python3.6/dist-packages/MeCab/__init__.py in __init__(self, rawargs)
    122 
    123         try:
--> 124             super(Tagger, self).__init__(args)
    125         except RuntimeError:
    126             error_info(rawargs)

RuntimeError:

エラー出力に親切にこちらを見るように言ってくれているので、URL先の指示通りに、mecab-python3をインストールするときに

pip install unidic-lite

も実行すればMeCabのinitializingで落ちることはなくなりました。が、今度はencodeでValueError: too many values to unpack (expected 2)と怒られてしまいました。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-f828f6470517> in <module>()
      6 text = "自然言語処理はとても楽しい。"
      7 
----> 8 wakati_ids = tokenizer.encode(text, return_tensors='pt')
      9 print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
     10 print(wakati_ids)

8 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_bert_japanese.py in tokenize(self, text, never_split, **kwargs)
    205                 break
    206 
--> 207             token, _ = line.split("\t")
    208             token_start = text.index(token, cursor)
    209             token_end = token_start + len(token)

ValueError: too many values to unpack (expected 2)

このエラーについてはこちらでmecab-python3の開発者？の方が言及されている通り、mecab-python3のバージョンを0.996.5に指定してやれば解決しました。

まとめると、pipであれこれインストールするときは以下のように宣言すればエラーでなくなると思います。

!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.996.5
!pip install unidic-lite
!pip install transformers

↑を実行する前に既にpipでmecab-python3の最新版をインストールしてしまっている場合は、一度colabのセッションを再接続するのをお忘れなく。セッションの切断の仕方はcolab画面の右上のRAMとかディスクとか書いてる▼をクリックして、セッションの管理から行えます。

from transformers.tokenization_bert_japanese import BertJapaneseTokenizer

# 日本語BERT用のtokenizerを宣言
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

text = "自然言語処理はとても楽しい。"

wakati_ids = tokenizer.encode(text, return_tensors='pt')
print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
print(wakati_ids)

# Downloading: 100%
# 258k/258k [00:00<00:00, 1.58MB/s]
#
# ['[CLS]', '自然', '言語', '処理', 'は', 'とても', '楽しい', '。', '[SEP]']
# tensor([[    2,  1757,  1882,  2762,     9,  8567, 19835,     8,     3]])

無事にBertJapaneseTokenizerで分かち書きできました。

おわり

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up