22
10

More than 3 years have passed since last update.

Huggingface Transformer version.3.5.1で、東北大学が作った日本語用の学習済みモデル 'cl-tohoku/bert-base-japanese-char-whole-word-masking'を使って成功した件

Last updated at Posted at 2020-12-10

:frowning2:
Huggingface Transformerは、バージョンアップが次々とされていて、メソッドや学習済みモデル(Pretrained model)の名前がバージョンごとに変わっているらしい。。

この記事では、version.3.5.1から、東北大学が公開している'cl-tohoku/bert-base-japanese-char-whole-word-masking'の呼び出しに成功したコードを紹介します。

実行環境

Terminal
% python --version
Python 3.6.3
% python

動かしたもの

@ichiroexさん 「huggingface/transformersのBertModelで日本語文章ベクトルを作成」

エラー

Python3.6.3
>>> from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'transformers.tokenization_bert_japanese'
>>> quit()

こんなことが書いてあるサイトが・・・

Python Error: No module named transformers.tokenization_bert_japanese

Python Error: No module named transformers.tokenization_bert_japanese

Install specific version:

pip install "transformers==2.5.1"

teratail 最近まで使えていたはずのモジュールがインポートできなくなった

!pip install transformers
from transformers.modeling_bert import BertModel
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer


ModuleNotFoundError: No module named 'transformers.modeling_bert'
ModuleNotFoundError: No module named 'transformers.tokenization_bert_japanese'

数日前にtransformersのバージョンが更新されていたようです。transformers==3.5.1 でうまくいきました。

とりあえず2.5.1を入れてみる

Terminal
% pip install "transformers==2.5.1"
Collecting transformers==2.5.1
  Downloading transformers-2.5.1-py3-none-any.whl (499 kB)
     |████████████████████████████████| 499 kB 8.4 MB/s 
Successfully installed boto3-1.16.33 botocore-1.19.33 jmespath-0.10.0 python-dateutil-2.8.1 s3transfer-0.3.3 sentencepiece-0.1.94 tokenizers-0.5.2 transformers-2.5.1
% 

実行成功

>>> import torch
>>> from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
>>> from transformers.modeling_bert import BertModel

以下はうまくいかない

>>> tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 390, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 493, in _from_pretrained
    list(cls.vocab_files_names.values()),
OSError: Model name 'bert-base-japanese' was not found in tokenizers model name list (bert-base-japanese, bert-base-japanese-whole-word-masking, bert-base-japanese-char, bert-base-japanese-char-whole-word-masking). We assumed 'bert-base-japanese' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.
>>> 

3.5.1を入れてみる

% pip install "transformers==3.5.1"
Collecting transformers==3.5.1
  Downloading transformers-3.5.1-py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 9.2 MB/s 

( ・・・省略・・・ )

% pip install "transformers==3.5.1"
Collecting transformers==3.5.1
  Downloading transformers-3.5.1-py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 9.2 MB/s 

やっぱりだめ

Python3.6,3
 python
Python 3.6.3 (default, Dec 10 2020, 22:43:16) 
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import torch
>>> from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
>>> from transformers.modeling_bert import BertModel
>>> tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1644, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'bert-base-japanese'. Make sure that:

- 'bert-base-japanese' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'bert-base-japanese' is the correct path to a directory containing relevant tokenizer files


>>> 

Transformer 3.5.1バージョンで使える日本語の学習済みモデルの名前は、'cl-tohoku/bert-base-japanese-char-whole-word-masking'だったみたい

https://huggingface.co/models?search=japanese
https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking
https://huggingface.co/cl-tohoku

スクリーンショット 2020-12-11 1.02.17.png

https://huggingface.co/transformers/pretrained_models.html

スクリーンショット 2020-12-11 0.58.23.png

スクリーンショット 2020-12-11 0.56.16.png

hbh.png

スクリーンショット 2020-12-11 0.22.40.png

fugashiを入れればうまくいきそう

Python3.6,3
>>> tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking')
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.7k/15.7k [00:00<00:00, 91.5kB/s]
Traceback (most recent call last):
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_bert_japanese.py", line 230, in __init__
    import fugashi
ModuleNotFoundError: No module named 'fugashi'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1653, in from_pretrained
    resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1725, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_bert_japanese.py", line 151, in __init__
    do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_bert_japanese.py", line 233, in __init__
    "You need to install fugashi to use MecabTokenizer."
ModuleNotFoundError: You need to install fugashi to use MecabTokenizer.See https://pypi.org/project/fugashi/ for installation.
>>> 

fugashiを入れる

Terminal
% pip install fugashi             
Collecting fugashi
  Downloading fugashi-1.0.5-cp36-cp36m-macosx_10_14_x86_64.whl (283 kB)
     |████████████████████████████████| 283 kB 8.7 MB/s 
Installing collected packages: fugashi
Successfully installed fugashi-1.0.5

ipadicを入れればうまくいきそう

Python3.6,3
>>> import torch
>>> from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
>>> from transformers.modeling_bert import BertModel
>>> tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking')

Traceback (most recent call last):
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_bert_japanese.py", line 242, in __init__
    import ipadic
ModuleNotFoundError: No module named 'ipadic'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1653, in from_pretrained
    resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1725, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_bert_japanese.py", line 151, in __init__
    do_lower_case=do_lower_case, never_split=never_split, **(mecab_kwargs or {})
  File "/Users/ocean/.pyenv/versions/3.6.3/lib/python3.6/site-packages/transformers/tokenization_bert_japanese.py", line 245, in __init__
    "The ipadic dictionary is not installed. "
ModuleNotFoundError: The ipadic dictionary is not installed. See https://github.com/polm/ipadic-py for installation.
>>> 

ipadicを入れる

Terminal
% pip install ipadic
Collecting ipadic
  Downloading ipadic-1.0.0.tar.gz (13.4 MB)
     |████████████████████████████████| 13.4 MB 8.5 MB/s 
Building wheels for collected packages: ipadic
  Building wheel for ipadic (setup.py) ... done
  Created wheel for ipadic: filename=ipadic-1.0.0-py3-none-any.whl size=13556723 sha256=7065d9649eb32a0b0d6dab0b266857ca6825322bc66a5982813185f8b9799c5b
  Stored in directory: /Users/ocean/Library/Caches/pip/wheels/99/39/4c/e723d99fed7aad240a3bea84ef21430209f58b313a9e70f7d6
Successfully built ipadic
Installing collected packages: ipadic
Successfully installed ipadic-1.0.0
% 

成功!

Python3.6,3
 python
Python 3.6.3 (default, Dec 10 2020, 22:43:16) 
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
>>> from transformers.modeling_bert import BertModel
>>> tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-char-whole-word-masking')
>>> 

ここからは、@kenta1984さんのQiitaの記事から、コードを写経して実行してみる

今度は、'cl-tohoku/bert-base-japanese-whole-word-masking'を使う。

実行成功

Python3.6.3
Python 3.6.3 (default, Dec 10 2020, 22:43:16) 
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from transformers import BertJapaneseTokenizer, BertForMaskedLM
>>> from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
>>> from transformers.modeling_bert import BertModel
>>> tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258k/258k [00:00<00:00, 362kB/s]
>>> 
>>> text = '新型コロナウイルス対策について厚生労働省に助言する専門家の会合が開かれ、感染状況について引き続き最大限の警戒が必要な状況だと評価しました。そのうえで、これまでの対策について「全体として必ずしも新規感染者数を減少させることに成功しているとは言い難い」と評価し、今月中旬をめどに感染拡大が沈静化に向かうかどうか評価してさらなる対策について早急に検討する必要があるとしています。'
>>> 
>>> tokenized_text = tokenizer.tokenize(text)
>>> print(tokenized_text)
['新型', 'コロナ', 'ウイルス', '対策', 'について', '厚生', '労働省', 'に', '助言', 'する', '専門', '家', 'の', '会合', 'が', '開か', 'れ', '、', '感染', '状況', 'について', '引き続き', '最大限', 'の', '警戒', 'が', '必要', 'な', '状況', 'だ', 'と', '評価', 'し', 'まし', 'た', '。', 'その', '##う', '##え', '##で', '、', 'これ', 'まで', 'の', '対策', 'について', '「', '全体', 'として', '必ずしも', '新規', '感染', '者', '数', 'を', '減少', 'さ', 'せる', 'こと', 'に', '成功', 'し', 'て', 'いる', 'と', 'は', '言い', '難い', '」', 'と', '評価', 'し', '、', '今', '##月', '中旬', 'を', 'め', '##ど', 'に', '感染', '拡大', 'が', '沈', '##静', '化', 'に', '向かう', 'か', 'どう', 'か', '評価', 'し', 'て', 'さらなる', '対策', 'について', '早', '##急', 'に', '検討', 'する', '必要', 'が', 'ある', 'と', 'し', 'て', 'い', 'ます', '。']
>>> 
>>> masked_index = 2
>>> tokenized_text[masked_index] = '[MASK]'
>>> print(tokenized_text[masked_index])
[MASK]
>>> 
>>> indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
>>> tokens_tensor = torch.tensor([indexed_tokens])
>>> model = BertForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 479/479 [00:00<00:00, 305kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 445M/445M [00:26<00:00, 16.6MB/s]
Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
>>> 
>>> model.eval()
BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)

( ・・・省略・・・ )

          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (cls): BertOnlyMLMHead(
    (predictions): BertLMPredictionHead(
      (transform): BertPredictionHeadTransform(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
      (decoder): Linear(in_features=768, out_features=32000, bias=True)
    )
  )
)
>>> 
>>> with torch.no_grad():
...     outputs = model(tokens_tensor)
...     predictions = outputs[0][0, masked_index].topk(5) # 予測結果の上位5件を抽出
... 
>>> 
>>> for i, index_t in enumerate(predictions.indices):
...     index = index_t.item()
...     token = tokenizer.convert_ids_to_tokens([index])[0]
...     print(i, token)
... 
0 
1 感染
2 
3 ##病
4 
>>> 
22
10
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
22
10