More than 3 years have passed since last update.

BERTScoreを計算しようとしたらKey Errorが出たので空白を消した

BERTScore

Posted at 2022-09-27

はじめに

私用でBERTScoreを計算したい場面があってKey Errorが出て少し詰まったのですが、空白を消したら上手く行ったので備忘録として残しておきたいと思います。

BERTScore

BERTScoreは、テキスト生成のための自動評価指標の一つで、BERTモデルから得られる埋め込み表現を用いて文の類似度を計算する手法です。

エラー出力例

calculating scores...
computing bert embedding.
100%
1118/1118 [07:47<00:00, 4.79it/s]
computing greedy matching.
0%
0/1118 [00:00<?, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_27455/569803675.py in <module>
     12 for i in range(11):
---> 14     P,R,F1 = score(orig_sentences,conv_sentences,lang="ja",verbose=True)
     16     P_list.append(P)

~/anaconda3/lib/python3.9/site-packages/bert_score/score.py in score(cands, refs, model_type, num_layers, verbose, idf, device, batch_size, nthreads, all_layers, lang, return_hash, rescale_with_baseline, baseline_path, use_fast_tokenizer)
    129         print("calculating scores...")
    130     start = time.perf_counter()
--> 131     all_preds = bert_cos_score_idf(
    132         model,
    133         refs,

~/anaconda3/lib/python3.9/site-packages/bert_score/utils.py in bert_cos_score_idf(model, refs, hyps, tokenizer, idf_dict, verbose, batch_size, device, all_layers)
    566             batch_refs = refs[batch_start : batch_start + batch_size]
    567             batch_hyps = hyps[batch_start : batch_start + batch_size]
--> 568             ref_stats = pad_batch_stats(batch_refs, stats_dict, device)
    569             hyp_stats = pad_batch_stats(batch_hyps, stats_dict, device)
    570 

~/anaconda3/lib/python3.9/site-packages/bert_score/utils.py in pad_batch_stats(sen_batch, stats_dict, device)
    539 
    540     def pad_batch_stats(sen_batch, stats_dict, device):
--> 541         stats = [stats_dict[s] for s in sen_batch]
    542         emb, idf = zip(*stats)
    543         emb = [e.to(device) for e in emb]

~/anaconda3/lib/python3.9/site-packages/bert_score/utils.py in <listcomp>(.0)
    539 
    540     def pad_batch_stats(sen_batch, stats_dict, device):
--> 541         stats = [stats_dict[s] for s in sen_batch]
    542         emb, idf = zip(*stats)
    543         emb = [e.to(device) for e in emb]

KeyError: ' 勇気が欲しいので、力を分けてくれませんか？'

入力文をチェック

Key Error on Sentence Classification example #80
I encountered key error by using my own data set #333
上記のIssuesなどをいくつか調べてみましたが、メッセージを見る限り明らかにdict[s]による参照が上手く行っていないことから、入力文字列に原因があるのではないかと考えました。
そこで入力文をチェックしてみると、「( )勇気が欲しいので、力を分けてくれませんか？」のように、()部分に空白スペースが含まれていることか分かりました。

冒頭の空白文字を除去

今回は日本語を扱っていたので、素直にスペースを置換することにしました。
ただし、英語のテキストは文字の区切り方から、このままでは適用できないので、もし英語でするとしたら、文字列の添え字で分割すればいいと思います。

eliminate_space.py

elim_sen = [str.replace(' ','') for str in orig_sentences]

参考文献

BERTScore: Evaluating Text Generation with BERT

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up