More than 1 year has passed since last update.

タイ語モデルTyphoon-7Bは「แม่อย่าเก็บไว้คนเดียว」をどうトークナイズするのか

Posted at 2024-04-01

Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee Manakul, Sittipong Sripaisarnmongkol, Ruangsak Patomwong, Pathomporn Chokchainant, Kasima Tharnpipitchai『Typhoon: Thai Large Language Models』を読みつつ、この「Typhoon-7B」のトークナイザがどうなっているのか気になった。というのも、論文には

In this work, we base our tokenizer on Mistral-7B tokenizer, but we further train an additional Thai subword tokenizer with 5,000 tokens and integrate it with the original tokenizer. This new tokenizer is created by training a SentencePiece model on around 8 million samples randomly selected from the Thai subset of MC4 data.

という風に、SentencePieceで「Thai subword tokenizer」を実装したと書かれているものの、「Thai Character Cluster」(คลัสเตอร์อักษรไทย)に関する記述がないのだ。ちょっと試してみよう。

>>> from transformers import AutoTokenizer
>>> tkz=AutoTokenizer.from_pretrained("scb10x/typhoon-7b")
>>> print(tkz.convert_ids_to_tokens(tkz("แม่อย่าเก็บไว้คนเดียว")["input_ids"]))
['<s>', '▁แม', '่อย', '่า', 'เก็บ', 'ไว้', 'คน', 'เดียว']

「แม่อย่าเก็บไว้คนเดียว」をトークナイズしてみたところ、最初の2単語「แม่」「อย่า」の切れ目が明らかにおかしい。もう少し細かくみてみよう。

>>> for t in tkz.convert_ids_to_tokens(tkz("แม่อย่าเก็บไว้คนเดียว")["input_ids"]):
...   print(" ",t)
...
  <s>
  ▁แม
  ่อย
  ่า
  เก็บ
  ไว้
  คน
  เดียว

声調記号のU+0E48「THAI CHARACTER MAI EK」が、子音の後ではなくトークンの頭にあって、かなりマズイ事態になっている。こういうの、どうしたらいいのかなあ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up