More than 1 year has passed since last update.

Deep Learningで文章をNERする(AIで暗記を効率化する)

Posted at 2022-10-04

google colab コード

概要

　今回はdeep learningを用いたモデルに文章を与えて、その文章の特定のキーワードだけmaskingすることを目指しました。
　例えば、多くの人は学生時代や資格の勉強などで文章にペンで緑色を引いて赤下敷きや赤シートで隠すということをやられたかと思います。しかし、それは意外と面倒です。下敷きで隠す前に線を引くだけで疲れてしまうこともあります。

　その線で隠すというタスクをdeep learningで自動でやってもらって勉強の能率をあげようという目的で取り組みました。

　これを達成する上で一番単純なのは、文章の固有名詞を抽出して全て隠すという方法です。
ただ、これは文章のキーワードが消えすぎてしまいます。一方で多くの文章にはその分野のキーワードだけではなく、色々な分野の固有名詞が含まれます。

　機械学習の勉強をする人であれば、文章の機械学習に強く関連したキーワードだけ隠したいはずです。他にも、歴史の勉強をする学生なら出来事の年代は覚えたいけど、人の名前はまだいいかな、というようにその時々によって隠したいキーワードの種類を変えたいというケースも存在すると考えられます。

　つまり、固有名詞だけ全部消すという方法では達成できない、特定の属性を持った固有名詞だけ選択的に隠す、ということを目的に今回取り組みました。
基本的に人工知能は人間がデータセットを用いて学習させるものです。そこで、発想の転換として人工知能に人間の学習を手伝ってもらうという観点で作成しました。

※この記事は先日行われた技育展2022で発表させていただいたもので「AIで暗記を効率化する」というテーマに関してのコードです。正直、周りの方のレベルが高すぎた(~~しかも技育展2022はすでに終わりました~~)ので書くのも恐れ多いのですが、誰かの役に立つことを願って公開します。

実行環境

google colaboratory
Python3

タスクの紹介

　今回行ったのはNERと呼ばれる自然言語処理のタスクです。NERは日本語では固有表現抽出とも呼ばれます。テキストに出現する人名や地名などの固有名詞や、日付や時間や数値などを抽出する技術です。

モデルの紹介

モデルは東北大学のbert-base-japanese-v2を用いました。

コード

1,データの取得

今回は「Wikipediaを用いた日本語の固有表現抽出データセット」を用います。

!git clone https://github.com/stockmarkteam/ner-wikipedia-dataset.git

また今回はbert-base-japanese-v2を用いるので、
以下のライブラリをpipでインストールします。

!pip install unidic_lite
!pip install fugashi
!pip install transformers
!pip install janome

次に使用するライブラリをimportします。

import json
import pandas as pd
from collections import Counter
import numpy as np
from tqdm import tqdm
from transformers import AutoTokenizer,AutoModelForTokenClassification,AutoConfig
from torch.utils.data import Dataset, DataLoader
import torch
from torch import cuda
from sklearn.metrics import accuracy_score
import os
from pathlib import Path
import spacy
from spacy import displacy
from pylab import cm, matplotlib
import gc
from scipy.special import softmax
os.environ["CUDA_VISIBLE_DEVICES"]="0"

今回使用するファイルを変数に格納します。

f = open ('/content/ner-wikipedia-dataset/ner.json', "r")
data = json.loads(f.read())

2,データの加工

jsonファイルの形式のままでは使いづらいので、データフレームの形式に変換します。

text_id=[]
text=[]
ent=[]
for item in data:
   text_id.append(item["curid"])
   text.append(item["text"])
   ent.append(item["entities"])
text_df = pd.DataFrame({'id': text_id, 'text': text,"ent":ent})

次に今回はbertを用いるので、テキストの前処理に使う関数を定義していきます。

tokenizer = AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-v2')

def convert_examples_to_features(text):
  tok = tokenizer.encode_plus(text,
  max_length=512,
  truncation=True,
  add_special_tokens=False)
  tok=tokenizer.convert_ids_to_tokens(tok["input_ids"])
  return tok

def remove_symbol(tex):
  tex=tex.replace("##","")
  return tex

def mode(list_):
  counter = Counter(list_)
  return counter.most_common()[0][0]

先ほど作成したデータフレームを今回の”NER”のタスクに用いやすいように整形していきます。

all_entities = []
all_text = []
all_id=[]
for ind,item in text_df.iterrows():
   if ind%1000==0:
       print(ind,', ',end='')
   t=convert_examples_to_features(item['text'])
   t=list(map(remove_symbol,t))
   t=" ".join(t)
   if "[UNK]" in t:
     continue
   total = len(item['text'])
   ent_str_list=["O" for _ in range(total)]
   for i in item["ent"]:
       ent_str_list[i["span"][0]:i["span"][1]]=[i["type"] for _ in range(i["span"][1]-i["span"][0])]
   cnt=0
   l=[]
   for word in t.split():
       l.append(ent_str_list[cnt:cnt+len(word)])
       cnt+=len(word)
   ent_word_list=list(map(mode,l))
   all_entities.append(ent_word_list)

   all_id.append(item["id"])
   all_text.append(item['text'])

text_df = pd.DataFrame({'id': all_id, 'text': all_text,"entities":all_entities})
text_df.head()

今回の学習データに含まれるlabelは全部で9種類あります。それを辞書形式で準備します。

output_labels = ['O', "人名","法人名","政治的組織名","その他の組織名","地名","施設名","製品名","イベント名"]
labels_to_ids = {v:k for k,v in enumerate(output_labels)}
ids_to_labels = {k:v for k,v in enumerate(output_labels)}

今回のモデルの設定を決めていきます。modelは日本語に対応した東北大学のbert-base-japanese-v2を用います。今回の学習用データセットに含まれるのは短文ばかりなので、tokenの長さは256にします。

config = {'model_name': 'cl-tohoku/bert-base-japanese-v2',  
        'max_length': 256,
        'train_batch_size':8,
        'valid_batch_size':8,
        'epochs':3,
        'learning_rates': [2.5e-5, 2.5e-5, 2.5e-6],
        'max_grad_norm':10,
        'device': 'cuda' if cuda.is_available() else 'cpu'}

今回はpytorchを用いるので、モデルに投入するためのDataset classを準備します。

class dataset(Dataset):
  def __init__(self, dataframe, tokenizer, max_len,inference=False):
       self.len = len(dataframe)
       self.data = dataframe
       self.tokenizer = tokenizer
       self.max_len = max_len
       self.inference = inference
  def __getitem__(self, index):
       text = self.data.text[index]       
       word_labels = self.data.entities[index]

       encoding = self.tokenizer.encode_plus(text,
                         max_length=self.max_len,
                         truncation=True,
                         padding='max_length')
       ids_length = encoding['input_ids']
      
       if self.inference:
         label_ids=[-1 for _ in range(self.max_len)]
       else:
         label_ids=[-100 for _ in range(self.max_len)]
         label_ids[1:len(word_labels)+1]=[labels_to_ids[label_] for label_ in word_labels]
       encoding['labels'] = label_ids

       item = {key: torch.as_tensor(val) for key, val in encoding.items()}
      
       return item

 def __len__(self):
       return self.len

データをtrainとvalidationに分けます。

IDS = text_df.id.unique()
np.random.seed(42)
train_idx = np.random.choice(np.arange(len(IDS)),int(0.9*len(IDS)),replace=False)
valid_idx = np.setdiff1d(np.arange(len(IDS)),train_idx)
np.random.seed(None)

先ほど作ったDatasetのclassでDatasetオブジェクトを作成していきます。

data = text_df[['id','text', 'entities']]
train_dataset = data.loc[data['id'].isin(IDS[train_idx]),['text', 'entities']].reset_index(drop=True)
test_dataset = data.loc[data['id'].isin(IDS[valid_idx])].reset_index(drop=True)

tokenizer = AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-v2')
training_set = dataset(train_dataset, tokenizer, config['max_length'])
testing_set = dataset(test_dataset, tokenizer, config['max_length'])

train_params = {'batch_size': config['train_batch_size'],
               'shuffle': True,
               'num_workers': 1,
               'pin_memory':True
               }
test_params = {'batch_size': config['valid_batch_size'],
               'shuffle': False,
               'num_workers': 1,
               'pin_memory':True
               }
training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

3,訓練に使用する関数の定義

def train(epoch):
   tr_loss, tr_accuracy = 0, 0
   nb_tr_examples, nb_tr_steps = 0, 0

   model.train()
  
   for idx, batch in enumerate(training_loader):
      
       ids = batch['input_ids'].to(config['device'], dtype = torch.long)
       mask = batch['attention_mask'].to(config['device'], dtype = torch.long)
       labels = batch['labels'].to(config['device'], dtype = torch.long)

       loss, tr_logits = model(input_ids=ids, attention_mask=mask, labels=labels,
                              return_dict=False)
       tr_loss += loss.item()

       nb_tr_steps += 1
       nb_tr_examples += labels.size(0)
      
       if idx % 200==0:
           loss_step = tr_loss/nb_tr_steps
           print(f"Training loss after {idx:04d} training steps: {loss_step}")
         
       flattened_targets = labels.view(-1)
       active_logits = tr_logits.view(-1, model.num_labels)
       flattened_predictions = torch.argmax(active_logits, axis=1)
      
       active_accuracy = labels.view(-1) != -100
      
       labels = torch.masked_select(flattened_targets, active_accuracy)
       predictions = torch.masked_select(flattened_predictions, active_accuracy)

       tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
       tr_accuracy += tmp_tr_accuracy
  
       torch.nn.utils.clip_grad_norm_(
           parameters=model.parameters(), max_norm=config['max_grad_norm']
       )
      
       optimizer.zero_grad()
       loss.backward()
       optimizer.step()

   epoch_loss = tr_loss / nb_tr_steps
   tr_accuracy = tr_accuracy / nb_tr_steps
   print(f"Training loss epoch: {epoch_loss}")
   print(f"Training accuracy epoch: {tr_accuracy}")

   valid(epoch)

次はvalidation用の関数を定義します。

def valid(epoch):
   val_loss, val_accuracy = 0, 0
   nb_val_examples, nb_val_steps = 0, 0

   model.eval()
  
   for idx, batch in enumerate(testing_loader):
      
       ids = batch['input_ids'].to(config['device'], dtype = torch.long)
       mask = batch['attention_mask'].to(config['device'], dtype = torch.long)
       labels = batch['labels'].to(config['device'], dtype = torch.long)

       loss, val_logits = model(input_ids=ids, attention_mask=mask, labels=labels,
                              return_dict=False)
       val_loss += loss.item()

       nb_val_steps += 1
       nb_val_examples += labels.size(0)
      
       if idx % 200==0:
           loss_step = val_loss/nb_val_steps
           print(f"Valid loss after {idx:04d} training steps: {loss_step}")
         
       flattened_targets = labels.view(-1)
       active_logits = val_logits.view(-1, model.num_labels)
       flattened_predictions = torch.argmax(active_logits, axis=1)
      
       active_accuracy = labels.view(-1) != -100
      
       labels = torch.masked_select(flattened_targets, active_accuracy)
       predictions = torch.masked_select(flattened_predictions, active_accuracy)

       tmp_val_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
       val_accuracy += tmp_val_accuracy

   epoch_loss = val_loss / nb_val_steps
   val_accuracy = val_accuracy / nb_val_steps
   print(f"Valid loss epoch: {epoch_loss}")
   print(f"Valid accuracy epoch: {val_accuracy}")

4,モデルの定義

モデルとoptimizerを準備します。

config_model = AutoConfig.from_pretrained('cl-tohoku/bert-base-japanese-v2', add_prefix_space=True)
config_model.num_labels = 9
model = AutoModelForTokenClassification.from_pretrained(
                  'cl-tohoku/bert-base-japanese-v2',config=config_model)
model.to(config['device'])
optimizer = torch.optim.Adam(params=model.parameters(), lr=config['learning_rates'][0])

5,モデルの学習

for epoch in range(config['epochs']):
   print(f"### Training epoch: {epoch + 1}")
   for g in optimizer.param_groups:
       g['lr'] = config['learning_rates'][epoch]
   lr = optimizer.param_groups[0]['lr']
   print(f'### LR = {lr}\n')
  
   train(epoch)
   torch.cuda.empty_cache()
   gc.collect()
torch.save(model.state_dict(), f'model.pt')

以下、学習結果です。

### Training epoch: 1
### LR = 2.5e-05

Training loss after 0000 training steps: 2.2537193298339844
Training loss after 0200 training steps: 0.36059665366710714
Training loss after 0400 training steps: 0.24726793236166536
Training loss epoch: 0.2050272538792342
Training accuracy epoch: 0.9423462963490461
Training loss after 0000 training steps: 0.005996639374643564
Valid loss epoch: 0.11047739142497234
Valid accuracy epoch: 0.9715233239395235
### Training epoch: 2
### LR = 2.5e-05

Training loss after 0000 training steps: 0.03339996561408043
Training loss after 0200 training steps: 0.06671449726225058
Training loss after 0400 training steps: 0.06751814809165653
Training loss epoch: 0.06462554331791277
Training accuracy epoch: 0.9818470349622725
Training loss after 0000 training steps: 0.00981813296675682
Valid loss epoch: 0.12009443794308669
Valid accuracy epoch: 0.9701264563013048
### Training epoch: 3
### LR = 2.5e-06

Training loss after 0000 training steps: 0.03403444215655327
Training loss after 0200 training steps: 0.03833096047321251
Training loss after 0400 training steps: 0.033768376736132956
Training loss epoch: 0.033311707269167525
Training accuracy epoch: 0.9913539572661758
Training loss after 0000 training steps: 0.001965571893379092
Valid loss epoch: 0.0973180867252214
Valid accuracy epoch: 0.9767143829766056

validationのaccuracyが97.6%なのでそこそこのモデルができてそうです。

6,モデルの確認

今度は作成したモデルの予測を確認していきましょう。

all_pred=[]
all_input_ids=[]

for idx, batch in enumerate(testing_loader):
   if idx==len(testing_loader):
       continue
   ids = batch['input_ids'].to(config['device'], dtype = torch.long)
   mask = batch['attention_mask'].to(config['device'], dtype = torch.long)

   logits = model(input_ids=ids, attention_mask=mask,return_dict=False)
   x=logits[0].to('cpu').detach().numpy().copy()
   x_mask=mask.to('cpu').detach().numpy().copy()
   x_ids=ids.to('cpu').detach().numpy().copy()
   pred_proba=softmax(x,axis=2)
   pred=np.argmax(pred_proba,axis=2)

   input_id=x_ids*x_mask
   ids=[ids_[1:np.sum(x_mask[ind])-1] for ind,ids_ in enumerate(input_id)]

   prediction=pred*x_mask
   prediction=[pred_[1:np.sum(x_mask[ind])-1] for ind,pred_ in enumerate(prediction)]
   all_pred.append(prediction)
   all_input_ids.append(ids)
   break

可視化するために色を定義します。

colors = {
           "人名": '#8000ff',
           "法人名": '#2b7ff6',
           "政治的組織名": '#2adddd',
           "その他の組織名": '#80ffb4',
           "地名": 'd4dd80',
           "施設名": '#ff8042',
           "製品名": '#ff0000',
           "イベント名":"＃fa8072"
        }

 
decode_text=tokenizer.convert_ids_to_tokens(all_input_ids[0][1])
prediction=all_pred[0][1]
ents = []
cnt=0
new_text=[]
for i ,j in zip(prediction,decode_text):
 if i!=0:
   ents.append({
       'start': cnt,
       'end': cnt+len(j),
       'label': ids_to_labels[i]
                   })
   new_text.append("x"*len(j))
 else:
   new_text.append(j)
 cnt+=len(j)

可視化します。

text="".join(new_text)
doc2 = {
   "text": text,
   "ents": ents,
   "title":"mask"
}
options = {"ents": list(colors.keys()), "colors": colors}
displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True)

text="".join(decode_text)
doc3 = {
   "text": text,
   "ents": ents,
   "title":"no_mask"
}
options = {"ents": list(colors.keys()), "colors": colors}
displacy.render(doc3, style="ent", options=options, manual=True, jupyter=True)

上記の文章を見る限りではうまくいってそうです。

　じゃあ次は学習用のデータセットにないデータを可視化してみましょう。その流れはtrain用のデータを準備するのと同じです。
「株式会社サポーターズは渋谷にあります」という文章を作成したモデルを用いてmaskしてみます。

text=["株式会社サポーターズは渋谷にあります"]
id=["test"]
ent=["O"]
test_df=pd.DataFrame({"id":id,"text":text,"ent":ent})

inference_params={'batch_size': 1,
               'shuffle': False,
               'num_workers': 1,
               'pin_memory':True
               }

all_entities = []
all_text = []
all_id=[]
for ind,item in test_df.iterrows():
   if ind%1000==0:
       print(ind,', ',end='')
   t=convert_examples_to_features(item['text'])
   t=list(map(remove_symbol,t))
   t=" ".join(t)
   if "[UNK]" in t:
     continue
   total = len(item['text'])
   ent_str_list=["O" for _ in range(total)]
   cnt=0
   l=[]
   for word in t.split():
       l.append(ent_str_list[cnt:cnt+len(word)])
       cnt+=len(word)
   ent_word_list=list(map(mode,l))
   all_entities.append(ent_word_list)

   all_id.append(item["id"])
   all_text.append(item['text'])

test_df = pd.DataFrame({'id': all_id, 'text': all_text})
test_df["entities"]=["O"]
test_df.head()

inference_set=dataset(test_df, tokenizer, config['max_length'],inference=True)
inference_loader = DataLoader(inference_set, **inference_params)

all_pred=[]
all_input_ids=[]

for idx, batch in enumerate(inference_loader):

   ids = batch['input_ids'].to(config['device'], dtype = torch.long)
   mask = batch['attention_mask'].to(config['device'], dtype = torch.long)

   logits = model(input_ids=ids, attention_mask=mask,return_dict=False)
   x=logits[0].to('cpu').detach().numpy().copy()
   x_mask=mask.to('cpu').detach().numpy().copy()
   x_ids=ids.to('cpu').detach().numpy().copy()
   pred_proba=softmax(x,axis=2)
   pred=np.argmax(pred_proba,axis=2)

   input_id=x_ids*x_mask
   ids=[ids_[1:np.sum(x_mask[ind])-1] for ind,ids_ in enumerate(input_id)]

   prediction=pred*x_mask
   prediction=[pred_[1:np.sum(x_mask[ind])-1] for ind,pred_ in enumerate(prediction)]
   all_pred.append(prediction)
   all_input_ids.append(ids)

target=None
def generate_masked_no_masked(prediction,decode_text,target=None):
   ents = []
   cnt=0
   masked_text=[]
   new_text=[]
   for i ,j in zip(prediction,decode_text):
     word_length=len(j)-j.count("#")
     j=j.replace("#","")
     new_text.append(j)
     if target:
       target_id=[labels_to_ids[target_ids_] for target_ids_ in target]
       if i in target_id:
         word_length=len(j)-j.count("#")
         ents.append({
               'start': cnt,
               'end': cnt+word_length,
               'label': ids_to_labels[i]
                           })
         masked_text.append("x"*word_length)
       else:
         masked_text.append(j)
     elif i!=0:
       word_length=len(j)-j.count("#")
       ents.append({
           'start': cnt,
           'end': cnt+word_length,
           'label': ids_to_labels[i]
                       })
       masked_text.append("x"*word_length)
     else:
       masked_text.append(j)
     cnt+=word_length
   return masked_text,new_text,ents

def visualize_text(masked_text,ents,title=None,target=None):
   if target:
     ents_color=target
   else:
     ents_color=list(colors.keys())
   text="".join(masked_text)
   doc2 = {
       "text": text,
       "ents": ents,
       "title":title
   }
   options = {"ents": ents_color, "colors": colors}
   displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True)


target=None
masked_text,new_text,ents=generate_masked_no_masked(prediction,decode_text,target)
visualize_text(masked_text,ents,"masked",target)
visualize_text(new_text,ents,"no_masked",target)

　サポーターズという単語は学習させていないですが、ちゃんと法人名として分類されています。うまくいっているようです。(サポーターズは技育展2022を主催してくださった会社です。)
　当初の目的は勉強で特定の要素だけを隠したい時にうまく隠すことでした。ですので、地名という要素だけ隠してみます。

target=["地名"]
masked_text,new_text,ents=generate_masked_no_masked(prediction,decode_text,target)
visualize_text(masked_text,ents,"masked",target)
visualize_text(new_text,ents,"no_masked",target)

うまく地名の部分だけを隠すことができました。

7,考察と反省

正直、どの程度うまく行くかは未知数だったのですが、想像以上にうまく隠すことができました。結果の可視化の際にtokenごとに色がつく設定にしてしまったので、同じ属性のtokenが並んだ場合には結合するなどした方が見栄えがよいかもしれません。
今回は使用したデータセットに"人名","法人名","政治的組織名","その他の組織名","地名","施設名","製品名","イベント名"しか含まれていなかったので、今回のモデルで隠すことができる属性は上記のみです。いずれそれぞれのドメインに特化したNERのモデルを作成すれば、色々な人の勉強に役立つものが作れそうです。

githubのURL→工事中

8,参考文献

(1)PyTorch - BigBird - NER - [CV 0.615]
https://www.kaggle.com/code/cdeotte/pytorch-bigbird-ner-cv-0-615
データセットの加工やモデルの作成において上記を参考にしました。
(2)Feedback Prize EDA with displacy
https://www.kaggle.com/code/thedrcat/feedback-prize-eda-with-displacy
予測結果の可視化をする際に上記を参考にしました。
(3)Wikipediaを用いた日本語の固有表現抽出データセット
https://github.com/stockmarkteam/ner-wikipedia-dataset
今回のデータセットは上記のデータセットを使用しました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up