チベット語モデルTiBERTは「ད་རིང་ཨ་མ་ཤི་བ།」をどうトークナイズするのか

Last updated at 2024-01-20Posted at 2024-01-20

Yuan Sun, Sisi Liu, Junjie Deng, Xiaobing Zhao『TiBERT: Tibetan Pre-trained Language Model』を読みつつ、このTiBERTのトークナイザがどうなっているのか気になった。というのも、論文には

Sentencepiece provides four modes: bpe, unigram, char, and word. The bpe model can only generate a unique sub-word sequence for a sentence, while the unigram language model can generate multiple candidate sub-word sequences, which can make the model more robust to noise and sub-word segmentation errors. Therefore, this paper uses the unigram language model to generate a vocabulary which can cover 99.95% of the characters in the dataset. Finally, the Tibetan vocabulary we constructed contains 30,005 words.

という風に、SentencepieceのUnigramモデルを使っていることが明記されている。でも、モデルをダウンロードしてみても、中はconfig.jsonとpytorch.binとvocab.txtだけで、spiece.modelが見当たらない。このvocab.txtは確かに30005行あって、最初の20行が

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
#
<s>
</s>
།
▁
་
་དང་
་།
འི་
་ལ་
་གྱི་
ས
་ཀྱི་
་ལ
་ནས་

となっている。うーん、Sentencepieceは確率にもとづくトークナイザなので、確率抜きにUnigramトークンだけ並べられても困るんだけど。まあ、仕方ないので、BertTokenizerFastでトークナイザを組み上げてみよう。Google Colaboratoryだと、こんな感じ。

!test -f TiBERT.zip || curl -LO https://tibert.cmli-nlp.com/model/TiBERT.zip
!test -d TiBERT（初始） || unzip TiBERT.zip
from transformers import BertTokenizerFast
tkz=BertTokenizerFast(vocab_file="TiBERT（初始）/vocab.txt",wordpieces_prefix="")
print(tkz.convert_ids_to_tokens(tkz("ད་རིང་ཨ་མ་ཤི་བ།")["input_ids"]))

チベット語の例文「ད་རིང་ཨ་མ་ཤི་བ།」をトークナイズしてみたところ、私(安岡孝一)の手元では以下の結果になった。

['[CLS]', 'ད', '་', 'རང', '་', 'ཨ', '་', 'མ', '་', 'ཤ', '་', 'བ', '།', '[SEP]']

いや、これでは細かすぎる。困ったな、どう組み上げれば、ちゃんとトークナイズできるんだろ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up