More than 5 years have passed since last update.

言語処理100本ノック-59:S式の解析

Posted at 2020-03-23

言語処理100本ノック 2015 「第6章: 英語テキストの処理」の59本目「S式の解析」記録です。
「S式」と呼ばれるフォーマットのパーサーを作ります。初めてパーサーを考えさせられましたが非常に奥が深いです。
今回のノックは、非常に時間がかかってしまいました。終わってみると50行程度なのですが、効率化の余地が非常に大きいノックです。今回は効率化を捨て、できるだけシンプルにしました。

参考リンク

リンク	備考
059.S式の解析.ipynb	回答プログラムのGitHubリンク
素人の言語処理100本ノック:59	多くのソース部分のコピペ元
Stanford Core NLP公式	最初に見ておくStanford Core NLPのページ

環境

種類	バージョン	内容
OS	Ubuntu18.04.01 LTS	仮想で動かしています
pyenv	1.2.16	複数Python環境を使うことがあるのでpyenv使っています
Python	3.8.1	pyenv上でpython3.8.1を使っていますパッケージはvenvを使って管理しています
Stanford CoreNLP	3.9.2	インストールしたのが1年前で詳しく覚えていないです・・・ 1年たってもそれが最新だったのでそのまま使いました
openJDK	1.8.0_242	他目的でインストールしていたJDKをそのまま使いました

第6章: 英語テキストの処理

学習内容

Stanford Core NLPを用いた英語のテキスト処理を通じて，自然言語処理の様々な基盤技術を概観します．

Stanford Core NLP, ステミング, 品詞タグ付け, 固有表現抽出, 共参照解析, 係り受け解析, 句構造解析, S式

ノック内容

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

59. S式の解析

Stanford Core NLPの句構造解析の結果（S式）を読み込み，文中のすべての名詞句（NP）を表示せよ．入れ子になっている名詞句もすべて表示すること．

課題補足(「S式」について)

Wikipedia「S式」によると以下の説明。

Lispで導入され、主にLispで用いられる、2分木ないしリスト構造の形式的な記述方式。SはSymbolに由来。

自然言語を「S式」で表現する仕組みはStanford Parserに書かれていて、オンラインテストツールもあります。
Pythonで「S式」を解析するパッケージもあったりしましたが、あまり使われていなそうなので、自作で頑張りました。

回答

回答プログラム 059.S式の解析.ipynb

import re
import xml.etree.ElementTree as ET

reg_split = re.compile(r'''
                         (      # グループ開始
                         \(|\)  # 分割文字のグループ(カッコ開始 or カッコ終了)
                         )      # グループ終了
                         ''', re.VERBOSE)

def output_np(chunks): 
    depth = 1
    output = []

    for chunk in chunks:
        
        # カッコの開始は深さを+1
        if chunk == '(':
            depth += 1

        # カッコの終了は深さを-1
        elif chunk == ')':
            depth -= 1
        else:
            
            # 分割して品詞とテキストのセットの場合は出力先に追加
            sets = chunk.split(' ')
            if len(sets) == 2:
                output.append(sets[1])
        
        # 深さが0になったら出力
        if depth == 0:
            print('\t', ' '.join(output))
            break

for parse in \
 ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence/parse'):
    
    depth = 0
    print(parse.text)
    
    # カッコの開始と終了で分割しリスト化(空白・空要素は除外)
    chunks = [chunk.strip() for chunk in reg_split.split(parse.text) 
                if chunk.strip() != '']
        
    # NPまで来たら出力開始
    for i, chunk in enumerate(chunks):
        if chunk == 'NP':
            output_np(chunks[i+1:])

回答解説

XMLファイルのパス

以下のXMLファイルのパスと目的とするS式とのマッピングです。parseタグの中にS式での内容が入っています。

出力	第1階層	第2階層	第3階層	第4階層	第5階層
S式	root	document	sentences	sentence	parse

XMLファイルはGitHubに置いています。

nlp.txt.xml(抜粋)

<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--中略--

        <parse>(ROOT (S (PP (IN As) (NP (JJ such))) (, ,) (NP (NN NLP)) (VP (VBZ is) (ADJP (VBN related) (PP (TO to) (NP (NP (DT the) (NN area)) (PP (IN of) (NP (JJ humani-computer) (NN interaction))))))) (. .))) </parse>

上記のS式を少し見やすいようにインデントしてみます。比較的短い文なのですが、それでも長いです・・・
このNP(名詞句)で囲まれた部分のテキスト部を結合して出力します。

(ROOT 
	(S 
		(PP 
			(IN As) 
			(NP 
				(JJ such)
			)
		) 
		(, ,) 
		(NP 
			(NN NLP)
		) 
		(VP 
			(VBZ is) 
			(ADJP 
				(VBN related) 
				(PP 
					(TO to) 
					(NP 
						(NP 
							(DT the) 
							(NN area)
						) 
						(PP 
							(IN of) 
							(NP 
								(JJ humani-computer) 
								(NN interaction)
							)
						)
					)
				)
			)
		) 
	(. .)
	)
)

メインループでのNPの探索

XMLを読み込んでループし、NPを探索している箇所です。
正規表現で分割リストし、邪魔な空白・空要素を除外しています。そして、NPが来たら出力用の関数output_npを呼んでいます。
上からNPを見つけたら出力する形にしているのですが、NPの入れ子の場合は同じようなロジックを複数回通るので効率悪いです。けど、シンプルにしておきたかったので効率悪いままにしています。

for parse in \
 ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence/parse'):

    depth = 0
    print(parse.text)

    # カッコの開始と終了で分割しリスト化(空白・空要素は除外)
    chunks = [chunk.strip() for chunk in reg_split.split(parse.text) 
                if chunk.strip() != '']

    # NPまで来たら出力開始
    for i, chunk in enumerate(chunks):
        if chunk == 'NP':
            output_np(chunks[i+1:])

NP出力部

カッコの開始と終了でS式の深さを判定して、NP部が終わったら出力をしています。

def output_np(chunks): 
    depth = 1
    output = []

    for chunk in chunks:

        # カッコの開始は深さを+1
        if chunk == '(':
            depth += 1

        # カッコの終了は深さを-1
        elif chunk == ')':
            depth -= 1
        else:

            # 分割して品詞とテキストのセットの場合は出力先に追加
            sets = chunk.split(' ')
            if len(sets) == 2:
                output.append(sets[1])

        # 深さが0になったら出力
        if depth == 0:
            print('\t', ' '.join(output))
            break

出力結果(実行結果)

プログラム実行すると以下の結果が出力されます(先頭抜粋)。

出力結果(先頭抜粋)

(ROOT (S (PP (NP (JJ Natural) (NN language) (NN processing)) (IN From) (NP (NNP Wikipedia))) (, ,) (NP (NP (DT the) (JJ free) (NN encyclopedia) (JJ Natural) (NN language) (NN processing)) (PRN (-LRB- -LRB-) (NP (NN NLP)) (-RRB- -RRB-))) (VP (VBZ is) (NP (NP (NP (DT a) (NN field)) (PP (IN of) (NP (NN computer) (NN science)))) (, ,) (NP (JJ artificial) (NN intelligence)) (, ,) (CC and) (NP (NP (NNS linguistics)) (VP (VBN concerned) (PP (IN with) (NP (NP (DT the) (NNS interactions)) (PP (IN between) (NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages)))))))))) (. .))) 
	 Natural language processing
	 Wikipedia
	 the free encyclopedia Natural language processing -LRB- NLP -RRB-
	 the free encyclopedia Natural language processing
	 NLP
	 a field of computer science , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
	 a field of computer science
	 a field
	 computer science
	 artificial intelligence
	 linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
	 linguistics
	 the interactions between computers and human -LRB- natural -RRB- languages
	 the interactions
	 computers and human -LRB- natural -RRB- languages
	 computers
	 human -LRB- natural -RRB- languages
(ROOT (S (PP (IN As) (NP (JJ such))) (, ,) (NP (NN NLP)) (VP (VBZ is) (ADJP (VBN related) (PP (TO to) (NP (NP (DT the) (NN area)) (PP (IN of) (NP (JJ humani-computer) (NN interaction))))))) (. .))) 
	 such
	 NLP
	 the area of humani-computer interaction
	 the area
	 humani-computer interaction
(ROOT (S (S (NP (NP (JJ Many) (NNS challenges)) (PP (IN in) (NP (NN NLP)))) (VP (VBP involve) (S (NP (NP (JJ natural) (NN language) (NN understanding)) (, ,) (SBAR (WHNP (WDT that)) (S (VP (VBZ is)))) (, ,)) (VP (VBG enabling) (NP (NNS computers)) (S (VP (TO to) (VP (VB derive) (NP (NN meaning)) (PP (IN from) (NP (ADJP (JJ human) (CC or) (JJ natural)) (NN language) (NN input)))))))))) (, ,) (CC and) (S (NP (NNS others)) (VP (VBP involve) (NP (JJ natural) (NN language) (NN generation)))) (. .))) 
	 Many challenges in NLP
	 Many challenges
	 NLP
	 natural language understanding , that is ,
	 natural language understanding
	 computers
	 meaning
	 human or natural language input
	 others
	 natural language generation

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up