Yongrui Xu, Caixia Mao, Zhiyong Wang, Guonian Jin, liangji Zhong & Tao Qian『Semantic-enhanced graph neural network for named entity recognition in ancient Chinese books』という論文が、NatureのScientific Reports 14に載っていたので、ざっと読んでみた。古典中国語における固有表現抽出のためのグラフニューラルネットワークを提案していて、その評価にC-CLUEのdata_nerを使っているらしいのだが、どうも様子がおかしい。たとえば「Table 1. Statistics of the C-CLUE datatset」は、タイトルのdatatsetという単語が意味不明だし、中身もおかしい。
Google Colaboratoryで確かめてみよう。
!test -d C-CLUE || git clone --depth=1 https://github.com/jizijing/C-CLUE
!tr " " "\012" < C-CLUE/data_ner/target.txt | sort | uniq -c | sed -n "s/^ *\([0-9]*\) B-\([A-Z]*\)$/\2 \1/p" > train.uniq
!tr " " "\012" < C-CLUE/data_ner/dev-label.txt | sort | uniq -c | sed -n "s/^ *\([0-9]*\) B-\([A-Z]*\)$/\2 \1/p" > dev.uniq
!tr " " "\012" < C-CLUE/data_ner/test_tgt.txt | sort | uniq -c | sed -n "s/^ *\([0-9]*\) B-\([A-Z]*\)$/\2 \1/p" > test.uniq
!join -a1 train.uniq dev.uniq | awk '{{if(NF==2)print $$0,0;else print}}' | join -a1 - test.uniq | awk '{{if(NF==3)print $$0,0;else print}}' > all.uniq
s='''BEGIN{printf("%5s %10s %10s %10s\\n","","Train","Dev","Test")}
{train+=$2;dev+=$3;test+=$4;printf("%5s %10d %10d %10d\\n",$1,$2,$3,$4)}
END{printf("%5s %10d %10d %10d\\n","Total",train,dev,test)}'''
!awk '{s}' all.uniq
私(安岡孝一)の手元では、以下の結果が得られた。
Train Dev Test
BOO 119 0 16
JOB 2252 448 349
LOC 3625 220 236
ORG 2041 4 45
PER 11532 756 859
WAR 6 0 0
Total 19575 1428 1505
ごらんのとおり、C-CLUEには「OFI」などというラベルは使われていない。これが仮に「JOB」のことだとして、「BOO」と「WAR」を除外することにしても、今度はDev (Validation set)の合計が合わない。ここが合わないと、他のTableの結果も微妙に違ってくる。特にTable 5は
2番目にRoberta-Classical-Chineseの結果を示しているので、私としても非常に気になるところだ。しかしながら、論文の末尾に
All codes and resources are released at the website: https://github.com/qtxcm/BAC-GNN-CRF.
と示されたwebsiteには、現時点では何も置かれていない。正直イラっと来たので、2023年5月12日の日記のアイデアを使って、transformersのrun_ner.pyでC-CLUEの固有表現抽出に挑戦してみた。Google Colaboratory (GPU版)だと、こんな感じ。
!pip install transformers datasets evaluate seqeval accelerate
!test -d C-CLUE || git clone --depth=1 https://github.com/jizijing/C-CLUE
s='$1=="transformers"{printf("-b v%s",$2)}'
!test -d transformers || git clone `pip list | awk '{s}'` https://github.com/huggingface/transformers
def makejson(token_file,tag_file,json_file):
with open(token_file,"r",encoding="utf-8") as r1, open(tag_file,"r",encoding="utf-8") as r2, open(json_file,"w",encoding="utf-8") as w:
for s,t in zip(r1,r2):
print('{"tokens":["'+s.rstrip().replace(' ','","')+'"],"tags":["'+t.rstrip().replace(' ','","')+'"]}',file=w)
makejson("C-CLUE/data_ner/source.txt","C-CLUE/data_ner/target.txt","train.json")
makejson("C-CLUE/data_ner/dev.txt","C-CLUE/data_ner/dev-label.txt","dev.json")
makejson("C-CLUE/data_ner/test1.txt","C-CLUE/data_ner/test_tgt.txt","test.json")
import sys,subprocess
for b in ["google-bert/bert-base-chinese","KoichiYasuoka/roberta-classical-chinese-base-char","ethanyt/guwenbert-base","SIKU-BERT/sikubert","Jihuai/bert-ancient-chinese"]:
f=False
for s in subprocess.run([sys.executable,"transformers/examples/pytorch/token-classification/run_ner.py","--model_name_or_path",b,"--train_file","train.json","--validation_file","dev.json","--test_file","test.json","--output_dir","/tmp","--overwrite_output_dir","--do_train","--do_eval","--do_predict"],capture_output=True,text=True).stdout.split("\n"):
if f and s.find("INFO")<0:
print(s)
elif s.startswith("***** train metrics "):
f=True
print("##### "+b+" #####\n"+s)
私の手元では、以下の結果が得られた。
##### google-bert/bert-base-chinese #####
***** train metrics *****
epoch = 3.0
total_flos = 335817GF
train_loss = 0.1562
train_runtime = 0:02:30.87
train_samples = 1902
train_samples_per_second = 37.82
train_steps_per_second = 4.732
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.9037
eval_f1 = 0.5964
eval_loss = 0.3251
eval_precision = 0.5417
eval_recall = 0.6634
eval_runtime = 0:00:02.21
eval_samples = 238
eval_samples_per_second = 107.595
eval_steps_per_second = 13.562
***** predict metrics *****
predict_accuracy = 0.9119
predict_f1 = 0.6513
predict_loss = 0.313
predict_precision = 0.5745
predict_recall = 0.752
predict_runtime = 0:00:02.06
predict_samples_per_second = 115.369
predict_steps_per_second = 14.542
##### KoichiYasuoka/roberta-classical-chinese-base-char #####
***** train metrics *****
epoch = 3.0
total_flos = 335817GF
train_loss = 0.2046
train_runtime = 0:03:19.43
train_samples = 1902
train_samples_per_second = 28.611
train_steps_per_second = 3.58
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.9066
eval_f1 = 0.6265
eval_loss = 0.3079
eval_precision = 0.5608
eval_recall = 0.7096
eval_runtime = 0:00:02.15
eval_samples = 238
eval_samples_per_second = 110.426
eval_steps_per_second = 13.919
***** predict metrics *****
predict_accuracy = 0.9104
predict_f1 = 0.6592
predict_loss = 0.298
predict_precision = 0.5705
predict_recall = 0.7805
predict_runtime = 0:00:02.09
predict_samples_per_second = 113.744
predict_steps_per_second = 14.338
##### ethanyt/guwenbert-base #####
***** train metrics *****
epoch = 3.0
total_flos = 335817GF
train_loss = 0.2094
train_runtime = 0:02:57.02
train_samples = 1902
train_samples_per_second = 32.232
train_steps_per_second = 4.033
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.9054
eval_f1 = 0.6258
eval_loss = 0.3065
eval_precision = 0.555
eval_recall = 0.7173
eval_runtime = 0:00:02.15
eval_samples = 238
eval_samples_per_second = 110.254
eval_steps_per_second = 13.898
***** predict metrics *****
predict_accuracy = 0.9077
predict_f1 = 0.652
predict_loss = 0.3033
predict_precision = 0.5622
predict_recall = 0.7759
predict_runtime = 0:00:02.44
predict_samples_per_second = 97.44
predict_steps_per_second = 12.282
##### SIKU-BERT/sikubert #####
***** train metrics *****
epoch = 3.0
total_flos = 335817GF
train_loss = 0.1446
train_runtime = 0:03:08.72
train_samples = 1902
train_samples_per_second = 30.234
train_steps_per_second = 3.783
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.9088
eval_f1 = 0.6211
eval_loss = 0.3067
eval_precision = 0.5628
eval_recall = 0.6928
eval_runtime = 0:00:02.07
eval_samples = 238
eval_samples_per_second = 114.492
eval_steps_per_second = 14.432
***** predict metrics *****
predict_accuracy = 0.913
predict_f1 = 0.6561
predict_loss = 0.3037
predict_precision = 0.5796
predict_recall = 0.756
predict_runtime = 0:00:02.42
predict_samples_per_second = 98.247
predict_steps_per_second = 12.384
##### Jihuai/bert-ancient-chinese #####
***** train metrics *****
epoch = 3.0
total_flos = 335817GF
train_loss = 0.1322
train_runtime = 0:03:41.68
train_samples = 1902
train_samples_per_second = 25.74
train_steps_per_second = 3.221
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.9118
eval_f1 = 0.613
eval_loss = 0.2833
eval_precision = 0.5734
eval_recall = 0.6585
eval_runtime = 0:00:02.05
eval_samples = 238
eval_samples_per_second = 115.602
eval_steps_per_second = 14.572
***** predict metrics *****
predict_accuracy = 0.9174
predict_f1 = 0.661
predict_loss = 0.2658
predict_precision = 0.5929
predict_recall = 0.7467
predict_runtime = 0:00:02.01
predict_samples_per_second = 118.208
predict_steps_per_second = 14.9
「predict metrics」を表形式にしてみよう。
Precision | Recall | F1 | |
---|---|---|---|
Bert-Base-Chinese | 57.45 | 75.20 | 65.13 |
Roberta-Classical-Chinese | 57.05 | 78.05 | 65.92 |
GuwenBert-Base | 56.22 | 77.59 | 65.20 |
SikuBert | 57.96 | 75.60 | 65.61 |
Bert-Ancient-Chinese | 59.29 | 74.67 | 66.10 |
うーん、この論文のTable 5とは、かなり異なる結果になってしまう。というか、Table 5のRecall値は、各モデルの持つポテンシャルを全体に下回っているということだ。どうして、こんなことになってるんだろ。