SwallowのLlamaForCausalLMに常用漢字を追加するには

Last updated at 2023-12-30Posted at 2023-12-30

私(安岡孝一)の12月21日の記事にも書いたが、Swallowのトークナイザは常用漢字2136字のうち51字をサポートしておらず、これら51字はバイトフォールバックしてしまう。この問題に対し昨日の記事では、Replaceを使って「𠮟」「塡」「剝」「頰」の4字を救う手を示したが、これだと抜本的解決にならない。抜本的解決としては、これら51字のトークンをSwallowのLlamaForCausalLMモデルそれ自体に追加して、追加したトークンで追加学習をおこなうしかない。端的には、こんな感じ。

#! /usr/bin/python3
import urllib.request,json
from transformers import LlamaTokenizerFast,LlamaForCausalLM,DataCollatorForLanguageModeling,TrainingArguments,Trainer
with urllib.request.urlopen("https://www.unicode.org/wg2/iso10646/edition6/data/JapaneseCoreKanji.txt") as r:
  joyo=[chr(int(t,16)) for t in r.read().decode().strip().split("\n") if not t.startswith("#")]
tkz=LlamaTokenizerFast.from_pretrained("tokyotech-llm/Swallow-7b-instruct-hf",pad_token="</s>")
c=[i for i,j in zip(joyo,tkz(joyo)["input_ids"]) if len(j)>3]
tkz.save_pretrained("mySwallow-7b-instruct-hf")
d=json.loads(tkz.backend_tokenizer.to_str())
for i,j in enumerate(c,len(tkz)):
  d["model"]["vocab"][j]=i
tkz.backend_tokenizer.from_str(json.dumps(d)).save("mySwallow-7b-instruct-hf/tokenizer.json")
mdl=LlamaForCausalLM.from_pretrained("tokyotech-llm/Swallow-7b-instruct-hf",device_map="auto")
tkz=LlamaTokenizerFast.from_pretrained("mySwallow-7b-instruct-hf",modex_max_length=mdl.config.max_position_embeddings)
mdl.resize_token_embeddings(len(tkz))
q=[tkz("常用漢字は子供の名づけに使えます。"+"は常用漢字なので、子供の名づけに使えます。".join(c+["\n"]))]
arg=TrainingArguments(num_train_epochs=1,per_device_train_batch_size=1,output_dir="/tmp",overwrite_output_dir=True,save_total_limit=1)
trn=Trainer(args=arg,model=mdl,data_collator=DataCollatorForLanguageModeling(tkz,mlm=False),train_dataset=q)
trn.train()
trn.save_model("mySwallow-7b-instruct-hf")
from transformers import TextGenerationPipeline
tgn=TextGenerationPipeline(model=mdl,tokenizer=tkz,max_new_tokens=128)
nlp=lambda txt:tgn(f"以下に、あるタスクを説明する指示があります。リクエストを適切に完了するための回答を記述してください。\n\n### 指示:{txt}\n\n### 応答:",do_sample=True)[0]["generated_text"]
print(nlp("頬と頰はどちらが子供の名づけに使えますか？"))

このプログラムはSwallow-7b-instruct-hfに、たった1行だけ追加学習をおこなうにもかかわらず、NVIDIA A100-SXM4-40GBが4枚も必要になって、なかなかに大規模言語モデルの悲哀を感じた。しかも、私の手元で出力された結果は以下のとおり。

以下に、あるタスクを説明する指示があります。リクエストを適切に完了するための回答を記述してください。

### 指示:頬と頰はどちらが子供の名づけに使えますか？

### 応答:「ほおずき」と「ほたる」はどちらも子どもの名づけに使えます。

70bモデルにも挑戦してみたい気がするのだが、さて、GPUが何枚ほど要るんだろ。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up