0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

XunziALLMは「不入虎穴不得虎子」をどうトークナイズするのか

Posted at

Overview of EvaHan2024: The First International Evaluation on Ancient Chinese Sentence Segmentation and Punctuation』を横目に、XunziALLMのトークナイザを試してみた。Google Colaboratoryだと、こんな感じ。

!pip install transformers tiktoken
!test -d Xunzi-Qwen || env GIT_LFS_SKIP_SMUDGE=1 git clone --depth=1 https://www.modelscope.cn/Xunzillm4cc/Xunzi-Qwen.git
from transformers import AutoTokenizer
tkz=AutoTokenizer.from_pretrained("Xunzi-Qwen",trust_remote_code=True)
print(tkz.convert_ids_to_tokens(tkz("不入虎穴不得虎子")["input_ids"]))

「不入虎穴不得虎子」をXunzi-Qwen-7Bでトークナイズしてみたところ、私(安岡孝一)の手元では以下の結果になった。

[b'\xe4\xb8\x8d', b'\xe5\x85\xa5', b'\xe8\x99\x8e', b'\xe7\xa9\xb4', b'\xe4\xb8\x8d\xe5\xbe\x97', b'\xe8\x99\x8e', b'\xe5\xad\x90']

素のUTF-8なので読みにくいが、「不」「入」「虎」「穴」「不得」「虎」「子」とトークナイズされているようだ。うーん、「不」「入」と「不得」でトークナイズの粒度が違うのは、かなり使いにくいなぁ。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?