0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

生成AIAdvent Calendar 2024

Day 21

Qwen2ForTokenClassificationによる英語Universal Dependenciesの品詞付与

Posted at

昨日の記事の手法をQwen2.5-0.5Bに適用して、ModernBERT-baseと比較してみることにした。Google Colaboratory (GPU版)だと、こんな感じ。

!pip install transformers datasets evaluate seqeval accelerate
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip3 list | awk '{s}'` https://github.com/huggingface/transformers
!test -d UD_English-EWT || git clone --depth=1 https://github.com/UniversalDependencies/UD_English-EWT
import json
def makejson(conllu_file,json_file):
  with open(conllu_file,"r",encoding="utf-8") as r, open(json_file,"w",encoding="utf-8") as w:
    d={"tokens":[],"tags":[]}
    for s in r:
      if s.strip()=="":
        if d["tokens"]>[]:
          print(json.dumps(d),file=w)
        d={"tokens":[],"tags":[]}
      else:
        t=s.split("\t")
        if len(t)==10 and t[0].isdecimal():
          d["tokens"].append(t[1])
          d["tags"].append(t[3])
makejson("UD_English-EWT/en_ewt-ud-train.conllu","train.json")
makejson("UD_English-EWT/en_ewt-ud-dev.conllu","dev.json")
makejson("UD_English-EWT/en_ewt-ud-test.conllu","test.json")
!env WANDB_DISABLED=true python3 transformers/examples/pytorch/token-classification/run_ner.py --task_name pos --model_name_or_path Qwen/Qwen2.5-0.5B --train_file train.json --validation_file dev.json --test_file test.json --output_dir ./Qwen2.5-0.5B-english-upos --overwrite_output_dir --do_train --do_eval --do_predict

私(安岡孝一)の手元では、1時間弱で以下のmetricsが出力されて、Qwen2.5-0.5B-english-uposが出来上がった。

***** train metrics *****
  epoch                    =        3.0
  total_flos               =  3490858GF
  train_loss               =     0.1989
  train_runtime            = 0:53:51.30
  train_samples            =      12544
  train_samples_per_second =     11.646
  train_steps_per_second   =      1.456

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9099
  eval_f1                 =     0.8934
  eval_loss               =     0.4152
  eval_precision          =     0.8943
  eval_recall             =     0.8926
  eval_runtime            = 0:00:20.36
  eval_samples            =       2001
  eval_samples_per_second =     98.271
  eval_steps_per_second   =     12.327

***** predict metrics *****
  predict_accuracy           =      0.913
  predict_f1                 =     0.8958
  predict_loss               =     0.3968
  predict_precision          =     0.8959
  predict_recall             =     0.8956
  predict_runtime            = 0:00:24.82
  predict_samples_per_second =     83.682
  predict_steps_per_second   =     10.475

eval・predictともにF1値が0.89程度で、正直あまり良くない。ちょっと動かしてみよう。

from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
tkz=AutoTokenizer.from_pretrained("Qwen2.5-0.5B-english-upos")
mdl=AutoModelForTokenClassification.from_pretrained("Qwen2.5-0.5B-english-upos")
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,device=0)
print(nlp("It don't mean a thing if it ain't got that swing"))

出来立てのQwen2.5-0.5B-english-uposで「It don't mean a thing if it ain't got that swing」に品詞付与してみたところ、私の手元では以下の結果が得られた。

[{'entity': 'PRON', 'score': 0.9992797, 'index': 0, 'word': 'It', 'start': 0, 'end': 2}, {'entity': 'AUX', 'score': 0.91161925, 'index': 1, 'word': 'Ġdon', 'start': 2, 'end': 6}, {'entity': 'PART', 'score': 0.9997278, 'index': 2, 'word': "'t", 'start': 6, 'end': 8}, {'entity': 'VERB', 'score': 0.9998871, 'index': 3, 'word': 'Ġmean', 'start': 8, 'end': 13}, {'entity': 'DET', 'score': 0.9993187, 'index': 4, 'word': 'Ġa', 'start': 13, 'end': 15}, {'entity': 'NOUN', 'score': 0.99999917, 'index': 5, 'word': 'Ġthing', 'start': 15, 'end': 21}, {'entity': 'SCONJ', 'score': 0.9998536, 'index': 6, 'word': 'Ġif', 'start': 21, 'end': 24}, {'entity': 'PRON', 'score': 0.9999038, 'index': 7, 'word': 'Ġit', 'start': 24, 'end': 27}, {'entity': 'AUX', 'score': 0.57848704, 'index': 8, 'word': 'Ġain', 'start': 27, 'end': 31}, {'entity': 'PART', 'score': 0.99963856, 'index': 9, 'word': "'t", 'start': 31, 'end': 33}, {'entity': 'VERB', 'score': 0.9222952, 'index': 10, 'word': 'Ġgot', 'start': 33, 'end': 37}, {'entity': 'ADV', 'score': 0.8038623, 'index': 11, 'word': 'Ġthat', 'start': 37, 'end': 42}, {'entity': 'ADJ', 'score': 0.44765362, 'index': 12, 'word': 'Ġswing', 'start': 42, 'end': 48}]

空白が「Ġ」に化けてしまっているのは、まあ何とかするとしても、「that」と「swing」が読めていない。『GPT系モデルの系列ラベリングによる品詞付与』でも書いたが、Qwen2ForTokenClassificationは、品詞付与の精度がイマイチ上がらないようだ。うーん、残念。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?