LoginSignup
1

More than 1 year has passed since last update.

自然言語処理モデルBERTの検証(4)-GLUEベンチマーク(その2)

Last updated at Posted at 2022-04-03
[前回] 自然言語処理モデルBERTの検証(3)-GLUEベンチマーク(その1)

はじめに

前回は、英語の言語理解ベンチマークであるGLUE(General Language Understanding Evaluation)について、
タスクCoLAの文法チェックを検証してみました。
今回は、他のタスクも試してみます。

GLUEタスクの種類

  • MNLI: 2つの入力文が意味的に含意/矛盾/中立か判定
  • QQP: 2つの質問文の意味が等価か判定
  • QNLI: Q&A
  • SST-2: 映画レビューの感情解析(ポジティブ、ネガティブ)
  • CoLA: 入力文が英語文法として正しいか判定
  • STS-B: ニュースの見出し文の類似度を5段階で評定
  • MRPC: 2つの文が等しいか否かを判定
  • RTE: 2つの入力文の含意を判定
  • SQuAD: Q&A
  • NER: 語の役割(人/組織/場所など)を特定
  • SWAG: 入力文に後続する文を4つの選択肢から選ぶ

※ 自然言語処理で含意関係とは、文T(Text)とH(Hypothesis)において、Tが正しい場合にHも正しいと推定できる関係

GLUEタスクのデータセット

以下で入手できます。
https://gluebenchmark.com/tasks

SST-2: 映画レビューの感情解析(ポジティブ、ネガティブ)

前回自然言語処理モデルBERTの検証(3)-GLUEベンチマークの手順で、
下記より前の手順は変わりませんので、そのまま実行しておきます。

GLUEからタスクを選択

tfds_name = 'glue/sst2'  #@param ['glue/cola', 'glue/sst2', 'glue/mrpc', 'glue/qqp', 'glue/mnli', 'glue/qnli', 'glue/rte', 'glue/wnli']

tfds_info = tfds.builder(tfds_name).info

sentence_features = list(tfds_info.features.keys())
sentence_features.remove('idx')
sentence_features.remove('label')

available_splits = list(tfds_info.splits.keys())
train_split = 'train'
validation_split = 'validation'
test_split = 'test'
if tfds_name == 'glue/mnli':
  validation_split = 'validation_matched'
  test_split = 'test_matched'

num_classes = tfds_info.features['label'].num_classes
num_examples = tfds_info.splits.total_num_examples

print(f'Using {tfds_name} from TFDS')
print(f'This dataset has {num_examples} examples')
print(f'Number of classes: {num_classes}')
print(f'Features {sentence_features}')
print(f'Splits {available_splits}')

with tf.device('/job:localhost'):
  # batch_size=-1 is a way to load the dataset into memory
  in_memory_ds = tfds.load(tfds_name, batch_size=-1, shuffle_files=True)

# The code below is just to show some samples from the selected dataset
print(f'Here are some sample rows from {tfds_name} dataset')
sample_dataset = tf.data.Dataset.from_tensor_slices(in_memory_ds[train_split])

labels_names = tfds_info.features['label'].names
print(labels_names)
print()

sample_i = 1
for sample_row in sample_dataset.take(5):
  samples = [sample_row[feature] for feature in sentence_features]
  print(f'sample row {sample_i}')
  for sample in samples:
    print(sample.numpy())
  sample_label = sample_row['label']

  print(f'label: {sample_label} ({labels_names[sample_label]})')
  print()
  sample_i += 1

glue/sst2タスクを選択し、実行します。

image.png

データセットのトレーニングのため、問題の種類(分類か回帰か)と損失関数を決定。

def get_configuration(glue_task):

  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

  if glue_task == 'glue/cola':
    metrics = tfa.metrics.MatthewsCorrelationCoefficient(num_classes=2)
  else:
    metrics = tf.keras.metrics.SparseCategoricalAccuracy(
        'accuracy', dtype=tf.float32)

  return metrics, loss

モデルのトレーニング

epochs = 3
batch_size = 32
init_lr = 2e-5

print(f'Fine tuning {tfhub_handle_encoder} model')
bert_preprocess_model = make_bert_preprocess_model(sentence_features)

with strategy.scope():

  # metric have to be created inside the strategy scope
  metrics, loss = get_configuration(tfds_name)

  train_dataset, train_data_size = load_dataset_from_tfds(
      in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
  steps_per_epoch = train_data_size // batch_size
  num_train_steps = steps_per_epoch * epochs
  num_warmup_steps = num_train_steps // 10

  validation_dataset, validation_data_size = load_dataset_from_tfds(
      in_memory_ds, tfds_info, validation_split, batch_size,
      bert_preprocess_model)
  validation_steps = validation_data_size // batch_size

  classifier_model = build_classifier_model(num_classes)

  optimizer = optimization.create_optimizer(
      init_lr=init_lr,
      num_train_steps=num_train_steps,
      num_warmup_steps=num_warmup_steps,
      optimizer_type='adamw')

  classifier_model.compile(optimizer=optimizer, loss=loss, metrics=[metrics])

  classifier_model.fit(
      x=train_dataset,
      validation_data=validation_dataset,
      steps_per_epoch=steps_per_epoch,
      epochs=epochs,
      validation_steps=validation_steps)

ファインチューニングに9分かかりました(タスクglue/colaの数倍)。
後、UserWarning表示されますが、無視します。

Fine tuning https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3 model
/usr/local/lib/python3.7/dist-packages/keras/engine/functional.py:559: UserWarning: Input dict contained keys ['idx', 'label'] which did not match any model input. They will be ignored by the model.
  inputs = self._flatten_to_reference_inputs(inputs)
Epoch 1/3
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/indexed_slices.py:446: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("AdamWeightDecay/gradients/StatefulPartitionedCall:1", shape=(None,), dtype=int32), values=Tensor("clip_by_global_norm/clip_by_global_norm/_0:0", dtype=float32), dense_shape=Tensor("AdamWeightDecay/gradients/StatefulPartitionedCall:2", shape=(None,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "shape. This may consume a large amount of memory." % value)

推論のためエクスポート

前処理部分とファインチューニング済みBERTを含む、最終モデルを作成。
モデルをColabに保存し、後でダウンロードできるようにします。

main_save_path = './my_models'
bert_type = tfhub_handle_encoder.split('/')[-2]
saved_model_name = f'{tfds_name.replace("/", "_")}_{bert_type}'

saved_model_path = os.path.join(main_save_path, saved_model_name)

preprocess_inputs = bert_preprocess_model.inputs
bert_encoder_inputs = bert_preprocess_model(preprocess_inputs)
bert_outputs = classifier_model(bert_encoder_inputs)
model_for_export = tf.keras.Model(preprocess_inputs, bert_outputs)

print('Saving', saved_model_path)

# Save everything on the Colab host (even the variables from TPU memory)
save_options = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
model_for_export.save(saved_model_path, include_optimizer=False,
                      options=save_options)

モデルをテスト

最後のステップとして、エクスポートされたモデル結果をテスト。
※ テストは、接続先のTPUワーカーではなく、Colabホストで実行される

with tf.device('/job:localhost'):
  reloaded_model = tf.saved_model.load(saved_model_path)
#@title Utility methods

def prepare(record):
  model_inputs = [[record[ft]] for ft in sentence_features]
  return model_inputs

def prepare_serving(record):
  model_inputs = {ft: record[ft] for ft in sentence_features}
  return model_inputs

def print_bert_results(test, bert_result, dataset_name):

  bert_result_class = tf.argmax(bert_result, axis=1)[0]

  if dataset_name == 'glue/cola':
    print('sentence:', test[0].numpy())
    if bert_result_class == 1:
      print('This sentence is acceptable')
    else:
      print('This sentence is unacceptable')

  elif dataset_name == 'glue/sst2':
    print('sentence:', test[0])
    if bert_result_class == 1:
      print('This sentence has POSITIVE sentiment')
    else:
      print('This sentence has NEGATIVE sentiment')

  elif dataset_name == 'glue/mrpc':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Are a paraphrase')
    else:
      print('Are NOT a paraphrase')

  elif dataset_name == 'glue/qqp':
    print('question1:', test[0])
    print('question2:', test[1])
    if bert_result_class == 1:
      print('Questions are similar')
    else:
      print('Questions are NOT similar')

  elif dataset_name == 'glue/mnli':
    print('premise   :', test[0])
    print('hypothesis:', test[1])
    if bert_result_class == 1:
      print('This premise is NEUTRAL to the hypothesis')
    elif bert_result_class == 2:
      print('This premise CONTRADICTS the hypothesis')
    else:
      print('This premise ENTAILS the hypothesis')

  elif dataset_name == 'glue/qnli':
    print('question:', test[0])
    print('sentence:', test[1])
    if bert_result_class == 1:
      print('The question is NOT answerable by the sentence')
    else:
      print('The question is answerable by the sentence')

  elif dataset_name == 'glue/rte':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Sentence1 DOES NOT entails sentence2')
    else:
      print('Sentence1 entails sentence2')

  elif dataset_name == 'glue/wnli':
    print('sentence1:', test[0])
    print('sentence2:', test[1])
    if bert_result_class == 1:
      print('Sentence1 DOES NOT entails sentence2')
    else:
      print('Sentence1 entails sentence2')

  print('BERT raw results:', bert_result[0])
  print()

テスト

with tf.device('/job:localhost'):
  test_dataset = tf.data.Dataset.from_tensor_slices(in_memory_ds[test_split])
  for test_row in test_dataset.shuffle(1000).map(prepare).take(5):
    if len(sentence_features) == 1:
      result = reloaded_model(test_row[0])
    else:
      result = reloaded_model(list(test_row))

    print_bert_results(test_row, result, tfds_name)

テスト結果です。

sentence: tf.Tensor([b"for anyone who grew up on disney 's 1950 treasure island , or remembers the 1934 victor fleming classic , this one feels like an impostor ."], shape=(1,), dtype=string)
This sentence has NEGATIVE sentiment
BERT raw results: tf.Tensor([ 3.5094056 -3.440254 ], shape=(2,), dtype=float32)

sentence: tf.Tensor([b'the only thing worse than your substandard , run-of-the-mill hollywood picture is an angst-ridden attempt to be profound .'], shape=(1,), dtype=string)
This sentence has NEGATIVE sentiment
BERT raw results: tf.Tensor([ 3.8467538 -3.1074412], shape=(2,), dtype=float32)

sentence: tf.Tensor([b'like a marathon runner trying to finish a race , you need a constant influx of liquid just to get through it .'], shape=(1,), dtype=string)
This sentence has NEGATIVE sentiment
BERT raw results: tf.Tensor([ 2.9375803 -2.4308932], shape=(2,), dtype=float32)

sentence: tf.Tensor([b'far from perfect , but its heart is in the right place ... innocent and well-meaning .'], shape=(1,), dtype=string)
This sentence has POSITIVE sentiment
BERT raw results: tf.Tensor([-5.709575   3.1275086], shape=(2,), dtype=float32)

sentence: tf.Tensor([b"it 's really yet another anemic and formulaic lethal weapon-derived buddy-cop movie , trying to pass off its lack of imagination as hip knowingness ."], shape=(1,), dtype=string)
This sentence has NEGATIVE sentiment
BERT raw results: tf.Tensor([ 3.471871 -3.522515], shape=(2,), dtype=float32)

映画レビューの感情解析結果、ポジティブ/ネガティブ判定がなされています。
それにしても、ポジティブ判定は一つのみで少ないです。
「完璧にはほど遠いですが」から、これも微妙?

おわりに

GLUEのタスクSST-2を解いてみました。
他のタスクも同じ要領で実行してみようと思います。

待望の日本語版GLUEベンチマーク(JGLUE)は、早大河原研究室が作成されており、
近々JGLUEベンチマーク結果が公開されるとのことでウォッチ中。。。

お楽しみに。

[次回] 自然言語処理モデルBERTの検証(5)-GLUEベンチマーク(その3)

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1