More than 5 years have passed since last update.

素人の言語処理100本ノック:58

Last updated at 2017-05-03Posted at 2017-01-05

言語処理100本ノック 2015の挑戦記録です。環境はUbuntu 16.04 LTS ＋ Python 3.5.2 :: Anaconda 4.1.1 (64-bit)です。過去のノックの一覧はこちらからどうぞ。

第6章: 英語テキストの処理

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

58. タプルの抽出

Stanford Core NLPの係り受け解析の結果（collapsed-dependencies）に基づき，「主語述語目的語」の組をタブ区切り形式で出力せよ．ただし，主語，述語，目的語の定義は以下を参考にせよ．

述語: nsubj関係とdobj関係の子（dependant）を持つ単語

主語: 述語からnsubj関係にある子（dependent）

目的語: 述語からdobj関係にある子（dependent）

出来上がったコード：

main.py

# coding: utf-8
import os
import subprocess
import xml.etree.ElementTree as ET

fname = 'nlp.txt'
fname_parsed = 'nlp.txt.xml'


def parse_nlp():
	'''nlp.txtをStanford Core NLPで解析しxmlファイルへ出力
	すでに結果ファイルが存在する場合は実行しない
	'''
	if not os.path.exists(fname_parsed):

		# StanfordCoreNLP実行、標準エラーはparse.outへ出力
		subprocess.run(
			'java -cp "/usr/local/lib/stanford-corenlp-full-2016-10-31/*"'
			' -Xmx2g'
			' edu.stanford.nlp.pipeline.StanfordCoreNLP'
			' -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref'
			' -file ' + fname + ' 2>parse.out',
			shell=True,		# shellで実行
			check=True		# エラーチェックあり
		)


# nlp.txtを解析
parse_nlp()

# 解析結果のxmlをパース
root = ET.parse(fname_parsed)

# sentence列挙、1文ずつ処理
for sentence in root.iterfind('./document/sentences/sentence'):
	sent_id = int(sentence.get('id'))		# sentenceのid

	# それぞれの語の辞書を作成
	dict_pred = {}		# {述語のidx, 述語のtext}
	dict_nsubj = {}		# {述語のidx, 述語とnsubj関係の子のtext（＝主語）}
	dict_dobj = {}		# {述語のidx, 述語とdobj関係の子のtext（＝目的語）}

	# dependencies列挙
	for dep in sentence.iterfind(
		'./dependencies[@type="collapsed-dependencies"]/dep'
	):

		# 関係チェック
		dep_type = dep.get('type')
		if dep_type == 'nsubj' or dep_type == 'dobj':

			# 述語の辞書に追加
			govr = dep.find('./governor')
			idx = govr.get('idx')
			dict_pred[idx] = govr.text		# 重複するが無害なのでチェックは省略

			# 主語or目的語の辞書に追加
			if dep_type == 'nsubj':
				dict_nsubj[idx] = dep.find('./dependent').text
			else:
				dict_dobj[idx] = dep.find('./dependent').text

	# 述語を列挙、主語と目的語の両方を持つもののみ出力
	for idx, pred in sorted(dict_pred.items(), key=lambda x: x[0]):
		nsubj = dict_nsubj.get(idx)
		dobj = dict_dobj.get(idx)
		if nsubj is not None and dobj is not None:
			print('{}\t{}\t{}'.format(nsubj, pred, dobj))

実行結果：

端末

understanding	enabling	computers
others	involve	generation
Turing	published	article
experiment	involved	translation
ELIZA	provided	interaction
ELIZA	provide	response
patient	exceeded	base
which	structured	information
underpinnings	discouraged	sort
that	underlies	approach
Some	produced	systems
which	make	decisions
systems	rely	which
that	contains	errors
implementations	involved	coding
algorithms	take	set
Some	produced	systems
which	make	decisions
models	have	advantage
they	express	certainty
Systems	have	advantages
Automatic	make	use
that	make	decisions

主語，述語，目的語の抽出

前問同様、Stanford Core NLPの係り受け解析結果を使います。前問では句読点を除くために<dep>タグのtype属性が「punct」かどうかをチェックしていましたが、今回は「nsubj」または「dobj」かどうかをチェックします。

type属性が「nsubj」または「dobj」なら、<governor>タグが述語で、<dependent>が主語または目的語になります。この関係にある<dep>タグを抽出して必要な情報を辞書に追加し、最後に述語に対して主語と目的語の両方が揃っているものを出力しました。

辞書の中身の列挙は順不同になってしまうので、述語の出現順でソートして出力しています。

　
59本目のノックは以上です。誤りなどありましたら、ご指摘いただけますと幸いです。

実行結果には、100本ノックで用いるコーパス・データで配布されているデータの一部が含まれます。この第6章で用いているデータのライセンスはクリエイティブ・コモンズ表示-継承 3.0 非移植（日本語訳）です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up