More than 1 year has passed since last update.

Google Colaboratoryの無料GPUとWikiText-JAで作る日本語DeBERTaミニモデル

Posted at 2024-01-05

昨日の記事のリベンジに、今度はGoogle ColaboratoryのGPUで、もう少し小さいモデルを作成してみることにした。以下のプログラムは、WikiText-JA 23718163字を単文字トークナイズして、入出力幅128トークン・入出力ベクトル256次元・深さ12層・アテンションヘッド4個・中間ベクトル768次元の日本語DeBERTaミニモデルを作成する。なお、単文字トークナイザは、NmtとNFKCで「全角文字」を「半角文字」に内部変換しつつ、常用漢字を加えた上で、BertTokenizerFastを魔改造している。一方train.txtは、各行が128文字未満となるよう整形している。

!pip install transformers accelerate
import re,urllib.request,unicodedata
from transformers import BertTokenizerFast,DebertaV2Config,DebertaV2ForMaskedLM,DataCollatorForLanguageModeling,TrainingArguments,Trainer
from tokenizers import pre_tokenizers,normalizers,Regex
url="http://www.lsta.media.kyoto-u.ac.jp/resource/data/wikitext-ja/"
p,i={"*447*":"\u30ab\u309a","*7003*":"","*8050*":"","*10789*":""},0
with open("train.txt","w",encoding="utf-8") as w:
  for t in ["Featured_Contents.txt","Good_Contents.txt"]:
    with urllib.request.urlopen(url+"Exception_"+t[0]+".txt") as r:
      e={"*"+s[0:-1]+"*":s[-1] if unicodedata.name(s[-1],False) else p["*"+s[0:-1]+"*"] for s in r.read().decode("utf-8").split("\n") if s.strip()>""}
    with urllib.request.urlopen(url+t) as r:
      for s in r.read().decode("utf-8").replace("。","。\n").split("\n"):
        for t in re.findall(r"\*[1-9][0-9]*\*",s):
          s=s.replace(t,e[t])
        if i+len(s)<128:
          print(s,end="",file=w)
          i+=len(s)
        else:
          print("\n"+s,end="",file=w)
          i=len(s)
  print("",file=w)
n=normalizers.Sequence([normalizers.Nmt(),normalizers.NFKC()])
with open("train.txt","r",encoding="utf-8") as r:
  v=set(c for c in n.normalize_str(r.read()) if not c.isspace())
with urllib.request.urlopen("https://www.unicode.org/wg2/iso10646/edition6/data/JapaneseCoreKanji.txt") as r:
  _=[v.add(chr(int(t,16))) for t in r.read().decode().strip().split("\n") if not t.startswith("#")]
with open("vocab.txt","w",encoding="utf-8") as w:
  print("\n".join(["[CLS]","[PAD]","[SEP]","[UNK]","[MASK]"]+sorted(v)),file=w)
tkz=BertTokenizerFast(vocab_file="vocab.txt",never_split=["[CLS]","[PAD]","[SEP]","[UNK]","[MASK]"],do_lower_case=False,strip_accents=False,tokenize_chinese_chars=True,model_max_length=128)
tkz.backend_tokenizer.pre_tokenizer=pre_tokenizers.Sequence([pre_tokenizers.Whitespace(),pre_tokenizers.Split(Regex("."),"isolated")])
tkz.backend_tokenizer.normalizer=n
tkz.backend_tokenizer.decoder.prefix=tkz.backend_tokenizer.model.continuing_subword_prefix=""
cfg=DebertaV2Config(hidden_size=256,num_hidden_layers=12,num_attention_heads=4,intermediate_size=768,relative_attention=True,position_biased_input=False,pos_att_type=["p2c","c2p"],max_position_embeddings=tkz.model_max_length,vocab_size=len(tkz),tokenizer_class=type(tkz).__name__,bos_token_id=tkz.cls_token_id,pad_token_id=tkz.pad_token_id,eos_token_id=tkz.sep_token_id)
arg=TrainingArguments(num_train_epochs=3,per_device_train_batch_size=64,output_dir="/tmp",overwrite_output_dir=True,save_total_limit=2)
class ReadLineDS(object):
  def __init__(self,file,tokenizer):
    self.tokenizer=tokenizer
    with open(file,"r",encoding="utf-8") as r:
      self.lines=[s.strip() for s in r if s.strip()!=""]
  __len__=lambda self:len(self.lines)
  __getitem__=lambda self,i:self.tokenizer(self.lines[i],truncation=True,add_special_tokens=True,max_length=self.tokenizer.model_max_length-2)
trn=Trainer(args=arg,data_collator=DataCollatorForLanguageModeling(tkz),model=DebertaV2ForMaskedLM(cfg),train_dataset=ReadLineDS("train.txt",tkz))
trn.train()
trn.save_model("deberta-mini-wikitext-ja")
tkz.save_pretrained("deberta-mini-wikitext-ja")
from transformers import pipeline
fmp=pipeline("fill-mask","deberta-mini-wikitext-ja")
print(fmp("酸素ボ[MASK]ベを充塡する。"))

Tesla T4だと、私(安岡孝一)の手元では1時間弱でモデルが完成し、以下の結果が出力された。

[{'score': 0.11631722003221512, 'token': 960, 'token_str': 'ー', 'sequence': '酸素ボーベを充塡する。'}, {'score': 0.08463738113641739, 'token': 955, 'token_str': 'ン', 'sequence': '酸素ボンベを充塡する。'}, {'score': 0.060339976102113724, 'token': 947, 'token_str': 'ル', 'sequence': '酸素ボルベを充塡する。'}, {'score': 0.0414130762219429, 'token': 897, 'token_str': 'ス', 'sequence': '酸素ボスベを充塡する。'}, {'score': 0.040101587772369385, 'token': 876, 'token_str': 'イ', 'sequence': '酸素ボイベを充塡する。'}]

「酸素ボ[MASK]ベを充塡する。」の[MASK]に、2番目とは言え「ン」を埋めてきているので、まあ、ミニモデルとしては上出来と言える。単文字トークナイザの日本語DeBERTaミニモデルは、OCRや誤字脱字訂正のバックエンドに向いている(らしい)ので、ぜひ挑戦してみてほしい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up