More than 3 years have passed since last update.

読み込んだテキストファイルの固有表現をラベルに置き換える(GiNZA使用)

Posted at 2020-05-13

それなりに苦戦したので一応載せておきます。
もっと良いやり方があるかもしれません。
私と同じ初学者の方は参考にしてください。

環境はpython3.6.9とUbuntu 18.04.4です。

change_NER.py

# coding:utf-8
import spacy

with open('input.txt','r') as f:
  nlp = spacy.load('ja_ginza')
  data = f.read()
  doc = nlp(data)

with open('output.txt','w') as f:

    text = list(data)                               # 1文字ずつリストに格納
    entity = [ent.label_ for ent in doc.ents]       # 固有表現のラベル
    start = [ent.start_char for ent in doc.ents]    # 何文字目から固有表現か
    end = [ent.end_char for ent in doc.ents]        # 何文字目まで固有表現か
    num = 0                                        
    stop = False

    for i in range(len(text)):
        if i == start[num]:
            f.write(entity[num])
            if num < len(start) - 1: # out of rangeの防止
                num += 1
            stop = True

        elif stop == True:
            if i < end[num-1]: # 固有表現の文字数分だけ
                continue　　　　# iを消費する
            elif i == end[num-1]:
                stop = False
                f.write(text[i])

        else:
            f.write(text[i])

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up