Qiita Engineer Festa20242024年7月17日まで開催中！

spaCy/GiNZA で固有表現抽出

Last updated at 2024-06-23Posted at 2024-06-18

初めに

最近，品詞分解や固有表現抽出などのタスクを行う機会がありました．
以下では，その中で得られたものを備忘録的にまとめていきます．

簡易目次

ja_ginza_bert_largeを使える環境を整える
固有表現を抽出する
特定の品詞を抽出する
batch処理を行う

その他

pyenv, venvを使用します
GPU環境です（CPUで動かすのは正直時間がかかって辛いです．）

spaCy, GiNZA どっちがどっちで何が何？という人へ（超簡易説明）

spaCyは，形態素解析（文章の品詞を調べたり）ができるライブラリです．
GiNZAは，spaCyというライブラリを介して使用できる，外付けの日本語特化モデルを指します．
コードの上では，GiNZAの出番は最初にモデルを読み込むところだけです．残りは全部spaCyでできることです．

環境の準備

基本的には，このページの通り．

venvで環境用意

# 仮想環境を作成（venv_bert_largeの部分は，任意の名前でOK）
$ python -m venv venv_bert_large

# 作成した仮想環境を起動
$ source venv_bert_large/bin/activate

# ja_ginza_bert_largeを動かすのに必要な諸々をインストール
$ pip install "https://github.com/megagonlabs/ginza/releases/download/v5.2.0/ja_ginza_bert_large-5.2.0b1-py3-none-any.whl"

# GPUを使う場合は，spacyをアップグレードする必要あり．
$ pip install -U spacy[cuda117]

ja_ginzaおよびja_ginza_electraとは共存できないので注意．

データセットを作ったり，分析したりする環境，モデルを訓練する環境と役割を分けた方がいいと思うのと，いろいろパッケージが競合するのと怖いのもあって，新しい環境を用意するのを推奨．

固有表現を抽出する

以下の手順で固有表現の抽出が可能です．

基本 - 単文の固有表現分析

import spacy
print(spacy.prefer_gpu()) # Trueと表示されれば, GPUが使用されます．

# 分析のclassを定義
nlp = spacy.load('ja_ginza_bert_large')

text = '私は東京都に住んでいます．'
# 分析
doc = nlp(text)

# doc.entsに固有表現の情報が入っているので，for文で全部printして中身を確認する．
for ent in doc.ents:
   print(ent.text, ent.label_)
# 東京都 Province

# 変数として扱いたければ，以下でもよし．
ent_dict = {ent.text: ent.label_ for ent in doc.ents} # {'東京都': 'Province'}

ent.label_の種類は関根の拡張固有表現階層が元．

以下のページのnerの部分が参考になるかと．
pypi ja-ginza-electra 5.2.0

ちなみに，nlpはTokenizer class，docはDoc classです．
spaCy documentation - Tokenizer
spaCy documentation - Doc

応用1 - 固有表現をent.label_で置換 & 置換した固有表現を収集

import spacy
print(spacy.prefer_gpu()) 
nlp = spacy.load('ja_ginza_bert_large')

text = '私は東京都に住んでいます．'
entity_record = {}
doc = nlp(text)
for ent in doc.ents:
    text　 = text.replace(ent.text, ent.label_)
    entity_record[ent.text] = ent.label_

# print(text)
# 私はProvinceに住んでいます．
# 都道府県名が置換されるので， 「私は沖縄県に住んでいます．」 も上記と同じ結果になる．

応用2 - 応用1で置換を無視したい文字列がある or 特定の文字列に関して，別のent.label_を設定したい．

import spacy
print(spacy.prefer_gpu())
nlp = spacy.load('ja_ginza_bert_large')

# 変更箇所↓
patterns =[
    {'label':'東京都', 'pattern':'東京都'},
    {'label':'okinawa', 'pattern':'沖縄県'}, 
]
nlp.add_pipe('entity_ruler', before='ner')
nlp.get_pipe('entity_ruler').add_patterns(patterns)

# あとは同じ．
text = '私は東京都に住んでいますが，沖縄県が好きです．福島県も好きです．'
entity_record = {}
doc = nlp(text)
for ent in doc.ents:
    text　 = text.replace(ent.text, ent.label_)
    entity_record[ent.text] = ent.label_

# print(text)
# 私は東京都に住んでいますが，okinawaが好きです．Provinceも好きです．

東京都の文字列を，そのent.label_である東京都に置換しているという理屈．

特定の品詞を抽出する

基本 - 単語と品詞のペアを確認

import spacy
print(spacy.prefer_gpu())
nlp = spacy.load('ja_ginza_bert_large')

text = '私は東京都に住んでいます．'
doc = nlp(text)

for token in doc:
    print(token.text, token.pos_)

# 私 PRON
# は ADP
# 東京都 PROPN
# に ADP
# 住ん VERB
# で SCONJ
# い VERB
# ます AUX
# 。 PUNCT

token. で確認できるもの一覧は，以下で確認できる．
spaCy documentation - Token

token.lemma_とかは，活用している単語の基本形を教えてくれるので便利

PRONとかは，Universal POS tagsと呼ばれるそう．
Universal POS tags

応用 - 名詞，動詞，形容詞のみを抽出

import spacy
print(spacy.prefer_gpu())
nlp = spacy.load('ja_ginza_bert_large')

text = '私は東京都に住んでいます．'
doc = nlp(text)

for token in doc:
    navs = [token.text for token in doc if token.pos_ in ['NOUN', 'ADJ', 'VERB']]

# print(navs)
# ['住ん', 'い']

私は代名詞，東京都は固有名詞ということで，ここでは選ばれない

Batch処理を行う

複数文章を同時に渡すことができます．

基本 - 複数文章の同時処理

import spacy
print(spacy.prefer_gpu())
nlp = spacy.load('ja_ginza_bert_large')

texts = ['私は東京都に住んでいます．', '私は沖縄県に住んでいます．']
docs = list(nlp.pipe(texts))

for doc in docs:
    for token in doc:
        print(token.text, token.pos_)

応用 - dataframeに格納された文章に対してまとめて色々やりたい時

import pandas as pd
from tqdm import tqdm
import spacy
print(spacy.prefer_gpu())


class MyGinzaClass:
    def __init__(self):
        self.nlp = spacy.load('ja_ginza_bert_large')

    def __call__(self, texts):
        docs = list(self.nlp.pip(texts))
        processed_texts = []
        for doc in docs:
            # やりたい処理を記入
            processed_text = self._yaritaisyori(doc)
            processed_texts.append(processed_text)
        return processed_texts
        
    def _yaritaisyori(self, doc):
        # やりたい処理を記入
        return text


my_ginza = MyGinzaClass()

for i in tqdm(range(0, len(df), 128)):
    # 128行ずつ処理を行う. 最後のあまりの調節．
    if i+128 >= len(df)：
        j = len(df)
    else:
        j = i+128

    temp_slice = df.loc[i:j, 'sukina_retsu'].tolist()
    df.loc[i:j, 'kakou_go'] = my_ginza(temp_slice)

spaCyのデフォルトでは，batch_sizeが1000となっているが，ja_ginza_bert_largeの方で上書きされ，128となっている． GitHub - ginza
ので， 128個ずつ加工しては格納してを繰り返す．

更新履歴

なし

何かあれば更新します．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up