More than 3 years have passed since last update.

文章の係り受け構造をグラフ表示する（GiNZA）

自然言語処理

Last updated at 2022-03-09Posted at 2022-03-09

概要

自然言語には係り受け構造が存在し，その構造から文をグラフ化することができます．それを踏まえてVGAEなどのグラフ生成の手法から自己回帰以外の文生成がしたいという要求があります．（未実現）

本記事では前処理に当たるグラフ化の部分を書いていきます．

準備

係り受け構造解析にはGiNZAを使います．
pip install -U ginza ja-ginza
pip install -U japanize-matplotlib

コード

import scipy
import spacy
import networkx as nx
import japanize_matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

class DependencyGraph:
    def __init__(self):
        self.nlp = spacy.load('ja_ginza')
        
    def run(self, text):
        doc = self.nlp(text)

        token_head_list = []
        for sent in doc.sents:
            token_head = []
            for token in sent:
                token_head.append({"i":token.i, "orth":token.orth_, "base": token.lemma_, "head": token.head.i, "dep": token.dep_}) 
            token_head_list.append(token_head)

        return token_head_list

    def graph(self, text):
        token_head_list = self.run(text)
        
        graphs = []
        for token_head in token_head_list:
            G = nx.Graph() #有向グラフ nx.DiGraph()  無向グラフ nx.Graph()
            start_position = token_head[0]["i"]
            for head in token_head:
                if head["orth"] not in list("。？！"): #句読点などを排除
                    G.add_edge(head["i"]-start_position, head["head"]-start_position)
            
            # 原型の場合はorth -> base 
            name_dict = dict([(head["i"]-start_position, head["orth"]+"##"+str(head["i"]-start_position)) for head in token_head])
            G = nx.relabel_nodes(G, name_dict)
            graphs.append(G)

        return graphs

複数の文で構成される場合，その数だけグラフが出てきます．（以下の場合２個）

analyzer = DependencyGraph()
graphs = analyzer.graph("アルバート湖から下流はアルバートナイルとも呼ばれる。ここから先が狭義のナイル川である。")

for g in graphs:
    print(nx.adjacency_matrix(g).toarray())
    print(g.nodes)
    nx.draw(g, with_labels=True, font_family='IPAexGothic', node_color="red", node_size=500)
    plt.show()

結果

「##」以下は元の文中で何単語目かを示します．（同じ単語が出てくる場合のため）

備考

稀に日本語化に失敗する場合がありますが，原因はよく分かりません．Colab環境をリセットしたら直りました．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up