More than 1 year has passed since last update.

タイ語モデルPhayaThaiBERTは「แม่อย่าเก็บไว้คนเดียว」をどうトークナイズするのか

Last updated at 2024-06-25Posted at 2024-06-25

Panyut Sriwirote, Jalinee Thapiang, Vasan Timtong, Attapol T. Rutherford『PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords』を読みつつ、この「PhayaThaiBERT」のトークナイザがどうなっているのか気になった。論文には

Since most of the model’s parameters are transfered directly from WangchanBERTa, it is imperative that we also preprocess and tokenize our training data in the same manner. Accordingly, we preprocess and tokenize all of our training data following the procedure described in the Methodology section of the WangchanBERTa paper, with the exception of the tokenization section, where we will use our expanded tokenizer instead.

と書かれているものの、どうもWangchanBERTaの悪いところ(複数の単語をまたいでトークナイズしてしまう)を引き継いでしまっているのではないか、と思えたからだ。

>>> from transformers import AutoTokenizer
>>> tkz=AutoTokenizer.from_pretrained("clicknext/phayathaibert")
>>> print(tkz.convert_ids_to_tokens(tkz("แม่อย่าเก็บไว้คนเดียว")["input_ids"]))
['<s>', '▁แม่', 'อย่า', 'เก็บไว้', 'คนเดียว', '</s>']

「แม่อย่าเก็บไว้คนเดียว」をトークナイズしてみたところ、「แม่」「อย่า」は正しくトークナイズできているものの、「เก็บ」「ไว้」「คน」「เดียว」がくっついてしまっている。やはり、WangchanBERTaの悪いところを引き継いでしまっているようだ。この論文は「LST20 corpus」を評価対象の一つにしてるのに、どうしてタイ語トークナイザの「品質」に無頓着なんだろ。ちなみに、4月11日の記事の2行目を「models=["clicknext/phayathaibert","airesearch/wangchanberta-base-att-spm-uncased"]」に変えて、UD_Thai-PUDによるトークナイザ・ベンチマークを実行した結果は以下の通り。

*** clicknext/phayathaibert
Precision 0.3978728432994564
Recall    0.3770719469581579
F1 Score  0.3871932286036295

*** airesearch/wangchanberta-base-att-spm-uncased
Precision 0.40143651529193697
Recall    0.3880924648329003
F1 Score  0.3946517242950207

正直かなり低い。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up