自然言語処理モデルBERTの検証(4)-GLUEベンチマーク(その2)

Last updated at 2022-04-06Posted at 2022-04-03

[前回] 自然言語処理モデルBERTの検証(3)-GLUEベンチマーク(その1)

はじめに

前回は、英語の言語理解ベンチマークであるGLUE(General Language Understanding Evaluation)について、
タスクCoLAの文法チェックを検証してみました。
今回は、他のタスクも試してみます。

GLUEタスクの種類

MNLI: 2つの入力文が意味的に含意/矛盾/中立か判定
QQP: 2つの質問文の意味が等価か判定
QNLI: Q&A
SST-2: 映画レビューの感情解析(ポジティブ、ネガティブ)
CoLA: 入力文が英語文法として正しいか判定
STS-B: ニュースの見出し文の類似度を5段階で評定
MRPC: 2つの文が等しいか否かを判定
RTE: 2つの入力文の含意を判定
SQuAD: Q&A
NER: 語の役割(人/組織/場所など)を特定
SWAG: 入力文に後続する文を4つの選択肢から選ぶ

※ 自然言語処理で含意関係とは、文T(Text)とH(Hypothesis)において、Tが正しい場合にHも正しいと推定できる関係

GLUEタスクのデータセット

以下で入手できます。
https://gluebenchmark.com/tasks

SST-2: 映画レビューの感情解析(ポジティブ、ネガティブ)

前回自然言語処理モデルBERTの検証(3)-GLUEベンチマークの手順で、
下記より前の手順は変わりませんので、そのまま実行しておきます。

GLUEからタスクを選択

tfds_name = 'glue/sst2'  #@param ['glue/cola', 'glue/sst2', 'glue/mrpc', 'glue/qqp', 'glue/mnli', 'glue/qnli', 'glue/rte', 'glue/wnli']

tfds_info = tfds.builder(tfds_name).info

sentence_features = list(tfds_info.features.keys())
sentence_features.remove('idx')
sentence_features.remove('label')

available_splits = list(tfds_info.splits.keys())
train_split = 'train'
validation_split = 'validation'
test_split = 'test'
if tfds_name == 'glue/mnli':
  validation_split = 'validation_matched'
  test_split = 'test_matched'

num_classes = tfds_info.features['label'].num_classes
num_examples = tfds_info.splits.total_num_examples

print(f'Using {tfds_name} from TFDS')
print(f'This dataset has {num_examples} examples')
print(f'Number of classes: {num_classes}')
print(f'Features {sentence_features}')
print(f'Splits {available_splits}')

with tf.device('/job:localhost'):
  # batch_size=-1 is a way to load the dataset into memory
  in_memory_ds = tfds.load(tfds_name, batch_size=-1, shuffle_files=True)

# The code below is just to show some samples from the selected dataset
print(f'Here are some sample rows from {tfds_name} dataset')
sample_dataset = tf.data.Dataset.from_tensor_slices(in_memory_ds[train_split])

labels_names = tfds_info.features['label'].names
print(labels_names)
print()

sample_i = 1
for sample_row in sample_dataset.take(5):
  samples = [sample_row[feature] for feature in sentence_features]
  print(f'sample row {sample_i}')
  for sample in samples:
    print(sample.numpy())
  sample_label = sample_row['label']

  print(f'label: {sample_label} ({labels_names[sample_label]})')
  print()
  sample_i += 1

glue/sst2タスクを選択し、実行します。

データセットのトレーニングのため、問題の種類(分類か回帰か)と損失関数を決定。

def get_configuration(glue_task):

  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

  if glue_task == 'glue/cola':
    metrics = tfa.metrics.MatthewsCorrelationCoefficient(num_classes=2)
  else:
    metrics = tf.keras.metrics.SparseCategoricalAccuracy(
        'accuracy', dtype=tf.float32)

  return metrics, loss

モデルのトレーニング

epochs = 3
batch_size = 32
init_lr = 2e-5

print(f'Fine tuning {tfhub_handle_encoder} model')
bert_preprocess_model = make_bert_preprocess_model(sentence_features)

with strategy.scope():

  # metric have to be created inside the strategy scope
  metrics, loss = get_configuration(tfds_name)

  train_dataset, train_data_size = load_dataset_from_tfds(
      in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
  steps_per_epoch = train_data_size // batch_size
  num_train_steps = steps_per_epoch * epochs
  num_warmup_steps = num_train_steps // 10

  validation_dataset, validation_data_size = load_dataset_from_tfds(
      in_memory_ds, tfds_info, validation_split, batch_size,
      bert_preprocess_model)
  validation_steps = validation_data_size // batch_size

  classifier_model = build_classifier_model(num_classes)

  optimizer = optimization.create_optimizer(
      init_lr=init_lr,
      num_train_steps=num_train_steps,
      num_warmup_steps=num_warmup_steps,
      optimizer_type='adamw')

  classifier_model.compile(optimizer=optimizer, loss=loss, metrics=[metrics])

  classifier_model.fit(
      x=train_dataset,
      validation_data=validation_dataset,
      steps_per_epoch=steps_per_epoch,
      epochs=epochs,
      validation_steps=validation_steps)

ファインチューニングに9分かかりました(タスクglue/colaの数倍)。
後、UserWarning表示されますが、無視します。

Fine tuning https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3 model
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:559: UserWarning: Input dict contained keys ['idx', 'label'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)
Epoch 1/3
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/indexed_slices.py:446: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("AdamWeightDecay/gradients/StatefulPartitionedCall:1", shape=(None,), dtype=int32), values=Tensor("clip_by_global_norm/clip_by_global_norm/_0:0", dtype=float32), dense_shape=Tensor("AdamWeightDecay/gradients/StatefulPartitionedCall:2", shape=(None,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "shape. This may consume a large amount of memory." % value)

推論のためエクスポート

前処理部分とファインチューニング済みBERTを含む、最終モデルを作成。
モデルをColabに保存し、後でダウンロードできるようにします。

main_save_path = './my_models'
bert_type = tfhub_handle_encoder.split('/')[-2]
saved_model_name = f'{tfds_name.replace("/", "_")}_{bert_type}'

saved_model_path = os.path.join(main_save_path, saved_model_name)

preprocess_inputs = bert_preprocess_model.inputs
bert_encoder_inputs = bert_preprocess_model(preprocess_inputs)
bert_outputs = classifier_model(bert_encoder_inputs)
model_for_export = tf.keras.Model(preprocess_inputs, bert_outputs)

print('Saving', saved_model_path)

# Save everything on the Colab host (even the variables from TPU memory)
save_options = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
model_for_export.save(saved_model_path, include_optimizer=False,
                      options=save_options)

モデルをテスト

最後のステップとして、エクスポートされたモデル結果をテスト。
※ テストは、接続先のTPUワーカーではなく、Colabホストで実行される

with tf.device('/job:localhost'):
  reloaded_model = tf.saved_model.load(saved_model_path)

#@title Utility methods

def prepare(record):
  model_inputs = [[record[ft]] for ft in sentence_features]
  return model_inputs

def prepare_serving(record):
  model_inputs = {ft: record[ft] for ft in sentence_features}
  return model_inputs

def print_bert_results(test, bert_result, dataset_name):

  bert_result_class = tf.argmax(bert_result, axis=1)[0]

  if dataset_name == 'glue/cola':
    print('sentence:', test[0].numpy())
    if bert_result_class == 1:
      print('This sentence is acceptable')
    else:
      print('This sentence is unacceptable')

  elif dataset_name == 'glue/sst2':
    print('sentence:', test[0])
    if bert_result_class == 1:
      print('This sentence has POSITIVE sentiment')
    else:
      print('This sentence has NEGATIVE sentiment')

  elif dataset_name == 'glue/mrpc':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Are a paraphrase')
    else:
      print('Are NOT a paraphrase')

  elif dataset_name == 'glue/qqp':
    print('question1:', test[0])
    print('question2:', test[1])
    if bert_result_class == 1:
      print('Questions are similar')
    else:
      print('Questions are NOT similar')

  elif dataset_name == 'glue/mnli':
    print('premise   :', test[0])
    print('hypothesis:', test[1])
    if bert_result_class == 1:
      print('This premise is NEUTRAL to the hypothesis')
    elif bert_result_class == 2:
      print('This premise CONTRADICTS the hypothesis')
    else:
      print('This premise ENTAILS the hypothesis')

  elif dataset_name == 'glue/qnli':
    print('question:', test[0])
    print('sentence:', test[1])
    if bert_result_class == 1:
      print('The question is NOT answerable by the sentence')
    else:
      print('The question is answerable by the sentence')

  elif dataset_name == 'glue/rte':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Sentence1 DOES NOT entails sentence2')
    else:
      print('Sentence1 entails sentence2')

  elif dataset_name == 'glue/wnli':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Sentence1 DOES NOT entails sentence2')
    else:
      print('Sentence1 entails sentence2')

  print('BERT raw results:', bert_result[0])
  print()

テスト

with tf.device('/job:localhost'):
  test_dataset = tf.data.Dataset.from_tensor_slices(in_memory_ds[test_split])
  for test_row in test_dataset.shuffle(1000).map(prepare).take(5):
    if len(sentence_features) == 1:
      result = reloaded_model(test_row[0])
    else:
      result = reloaded_model(list(test_row))

    print_bert_results(test_row, result, tfds_name)

テスト結果です。

sentence: tf.Tensor([b"for anyone who grew up on disney 's 1950 treasure island , or remembers the 1934 victor fleming classic , this one feels like an impostor ."], shape=(1,), dtype=string)
This sentence has NEGATIVE sentiment
BERT raw results: tf.Tensor([ 3.5094056 -3.440254 ], shape=(2,), dtype=float32)

sentence: tf.Tensor([b'the only thing worse than your substandard , run-of-the-mill hollywood picture is an angst-ridden attempt to be profound .'], shape=(1,), dtype=string)
This sentence has NEGATIVE sentiment
BERT raw results: tf.Tensor([ 3.8467538 -3.1074412], shape=(2,), dtype=float32)

sentence: tf.Tensor([b'like a marathon runner trying to finish a race , you need a constant influx of liquid just to get through it .'], shape=(1,), dtype=string)
This sentence has NEGATIVE sentiment
BERT raw results: tf.Tensor([ 2.9375803 -2.4308932], shape=(2,), dtype=float32)

sentence: tf.Tensor([b'far from perfect , but its heart is in the right place ... innocent and well-meaning .'], shape=(1,), dtype=string)
This sentence has POSITIVE sentiment
BERT raw results: tf.Tensor([-5.709575   3.1275086], shape=(2,), dtype=float32)

sentence: tf.Tensor([b"it 's really yet another anemic and formulaic lethal weapon-derived buddy-cop movie , trying to pass off its lack of imagination as hip knowingness ."], shape=(1,), dtype=string)
This sentence has NEGATIVE sentiment
BERT raw results: tf.Tensor([ 3.471871 -3.522515], shape=(2,), dtype=float32)

映画レビューの感情解析結果、ポジティブ/ネガティブ判定がなされています。
それにしても、ポジティブ判定は一つのみで少ないです。
「完璧にはほど遠いですが」から、これも微妙?

おわりに

GLUEのタスクSST-2を解いてみました。
他のタスクも同じ要領で実行してみようと思います。

待望の日本語版GLUEベンチマーク(JGLUE)は、早大河原研究室が作成されており、
近々JGLUEベンチマーク結果が公開されるとのことでウォッチ中。。。

お楽しみに。

[次回] 自然言語処理モデルBERTの検証(5)-GLUEベンチマーク(その3)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up