0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

LtgbertForTokenClassificationによる英語Universal Dependenciesの品詞付与

Posted at

昨日の記事の手法を「HPLT Bert for English」に適用して、UD_English-EWTUPOS品詞付与を試してみた。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers datasets evaluate seqeval accelerate
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip3 list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_English-EWT || git clone --depth=1 https://github.com/UniversalDependencies/UD_English-EWT
import json
def makejson(conllu_file,json_file):
  with open(conllu_file,"r",encoding="utf-8") as r, open(json_file,"w",encoding="utf-8") as w:
    d={"tokens":["[CLS]"],"tags":["SYM"]}
    for s in r:
      if s.strip()=="":
        if len(d["tokens"])>1:
          d["tokens"].append("[SEP]")
          d["tags"].append("SYM")
          print(json.dumps(d),file=w)
        d={"tokens":["[CLS]"],"tags":["SYM"]}
      else:
        t=s.split("\t")
        if len(t)==10 and t[0].isdecimal():
          d["tokens"].append(t[1])
          d["tags"].append(t[3])
makejson("UD_English-EWT/en_ewt-ud-train.conllu","train.json")
makejson("UD_English-EWT/en_ewt-ud-dev.conllu","dev.json")
makejson("UD_English-EWT/en_ewt-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python3 transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path HPLT/hplt_bert_base_en --trust_remote_code --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./ltgbert-base-english-upos --overwrite_output_dir --do_train --do_eval --do_predict

私(安岡孝一)の手元では、15分ほどで以下のmetricsが出力されて、ltgbert-base-english-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =   995877GF
  train_loss               =      0.053
  train_runtime            = 0:12:44.55
  train_samples            =      12544
  train_samples_per_second =     49.221
  train_steps_per_second   =      6.153

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9806
  eval_f1                 =     0.9769
  eval_loss               =     0.0808
  eval_precision          =     0.9763
  eval_recall             =     0.9776
  eval_runtime            = 0:00:08.64
  eval_samples            =       2001
  eval_samples_per_second =    231.576
  eval_steps_per_second   =     29.048

***** predict metrics *****
  predict_accuracy           =     0.9821
  predict_f1                 =     0.9785
  predict_loss               =     0.0757
  predict_precision          =     0.9773
  predict_recall             =     0.9797
  predict_runtime            = 0:00:09.58
  predict_samples_per_second =    216.764
  predict_steps_per_second   =     27.135

eval・predictともにF1値が0.98弱で、うまく行ってるようだ。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("ltgbert-base-english-upos")
mdl=AutoModelForTokenClassification.from_pretrained("ltgbert-base-english-upos",trust_remote_code=True)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,device=0)
print(nlp("It don't mean a thing if it ain't got that swing"))

出来立てのltgbert-base-english-uposで「It don't mean a thing if it ain't got that swing」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'PRON', 'score': 0.99985194, 'index': 1, 'word': 'âĸģIt', 'start': 0, 'end': 2}, {'entity': 'AUX', 'score': 0.9993647, 'index': 2, 'word': 'âĸģdon', 'start': 2, 'end': 6}, {'entity': 'PART', 'score': 0.40081334, 'index': 3, 'word': "'", 'start': 6, 'end': 7}, {'entity': 'PART', 'score': 0.9993881, 'index': 4, 'word': 't', 'start': 7, 'end': 8}, {'entity': 'VERB', 'score': 0.99983907, 'index': 5, 'word': 'âĸģmean', 'start': 8, 'end': 13}, {'entity': 'DET', 'score': 0.9998281, 'index': 6, 'word': 'âĸģa', 'start': 13, 'end': 15}, {'entity': 'NOUN', 'score': 0.99954957, 'index': 7, 'word': 'âĸģthing', 'start': 15, 'end': 21}, {'entity': 'SCONJ', 'score': 0.99974984, 'index': 8, 'word': 'âĸģif', 'start': 21, 'end': 24}, {'entity': 'PRON', 'score': 0.99986565, 'index': 9, 'word': 'âĸģit', 'start': 24, 'end': 27}, {'entity': 'AUX', 'score': 0.9997147, 'index': 10, 'word': 'âĸģain', 'start': 27, 'end': 31}, {'entity': 'PART', 'score': 0.36902195, 'index': 11, 'word': "'", 'start': 31, 'end': 32}, {'entity': 'PART', 'score': 0.998618, 'index': 12, 'word': 't', 'start': 32, 'end': 33}, {'entity': 'VERB', 'score': 0.9987697, 'index': 13, 'word': 'âĸģgot', 'start': 33, 'end': 37}, {'entity': 'DET', 'score': 0.98608893, 'index': 14, 'word': 'âĸģthat', 'start': 37, 'end': 42}, {'entity': 'NOUN', 'score': 0.9983407, 'index': 15, 'word': 'âĸģswing', 'start': 42, 'end': 48}]

語頭に「âĸģ」が付加されてしまってるのは、まあ何とかするとしても、やはり「don't」と「ain't」がイマイチだ。UD_English-EWTでは「don't」は「do」「n't」に、「ain't」は「ai」「n't」に切って、AUXとPARTを品詞付与するのが流儀なのだが、上の結果では「don」「'」「t」と「ain」「'」「t」に切られてしまってる。ただ、そのあたりは、むしろトークナイザの問題なので、さて、LTG-BERTの Pull Requestをそろそろ何とかすべきなのかな。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?