More than 5 years have passed since last update.

素人の言語処理100本ノック:54

Last updated at 2017-05-03Posted at 2016-12-23

言語処理100本ノック 2015の挑戦記録です。環境はUbuntu 16.04 LTS ＋ Python 3.5.2 :: Anaconda 4.1.1 (64-bit)です。過去のノックの一覧はこちらからどうぞ。

第6章: 英語テキストの処理

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

54. 品詞タグ付け

Stanford Core NLPの解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

出来上がったコード：

main.py

# coding: utf-8
import os
import subprocess
import xml.etree.ElementTree as ET

fname = 'nlp.txt'
fname_parsed = 'nlp.txt.xml'


def parse_nlp():
	'''nlp.txtをStanford Core NLPで解析しxmlファイルへ出力
	すでに結果ファイルが存在する場合は実行しない
	'''
	if not os.path.exists(fname_parsed):

		# StanfordCoreNLP実行、標準エラーはparse.outへ出力
		subprocess.run(
			'java -cp "/usr/local/lib/stanford-corenlp-full-2016-10-31/*"'
			' -Xmx2g'
			' edu.stanford.nlp.pipeline.StanfordCoreNLP'
			' -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref'
			' -file ' + fname + ' 2>parse.out',
			shell=True,		# shellで実行
			check=True		# エラーチェックあり
		)


# nlp.txtを解析
parse_nlp()

# 解析結果のxmlをパース
root = ET.parse(fname_parsed)

# tokenの抽出
for token in root.iter('token'):

	# 単語、レンマ、品詞の抽出
	word = token.findtext('word')
	lemma = token.findtext('lemma')
	pos = token.findtext('POS')
	print('{}\t{}\t{}'.format(word, lemma, pos))

実行結果：

長いので先頭部分の抜粋です。

端末：先頭部分

Natural	natural	JJ
language	language	NN
processing	processing	NN
From	from	IN
Wikipedia	Wikipedia	NNP
,	,	,
the	the	DT
free	free	JJ
encyclopedia	encyclopedia	NN
Natural	natural	JJ
language	language	NN
processing	processing	NN
-LRB-	-lrb-	-LRB-
NLP	nlp	NN
-RRB-	-rrb-	-RRB-
is	be	VBZ
a	a	DT
field	field	NN
of	of	IN
computer	computer	NN
science	science	NN
,	,	,
artificial	artificial	JJ
intelligence	intelligence	NN
,	,	,
and	and	CC
linguistics	linguistics	NNS
concerned	concern	VBN
with	with	IN
the	the	DT
interactions	interaction	NNS
between	between	IN
computers	computer	NNS
and	and	CC
human	human	JJ
-LRB-	-lrb-	-LRB-
natural	natural	JJ
-RRB-	-rrb-	-RRB-
languages	language	NNS
.	.	.
As	as	IN
such	such	JJ
,	,	,
NLP	nlp	NN
is	be	VBZ
related	relate	VBN
to	to	TO
the	the	DT
area	area	NN
of	of	IN
humani-computer	humani-computer	JJ
interaction	interaction	NN
.	.	.

全体の結果はGitHubにアップしています。

XMLの解析

今回はtokenタグを抽出し、その中をElement.findtext()で探して、目的のタグの中身を抽出しています。

レンマとは

レンマとは、ウィキペディアの語彙素の解説によると、辞書における見出し語に相当する語とのことです。
問題52で出てきた語幹と少し似ていますが、語幹は語尾変化する部分を切り捨ててしまうのでまともな単語にならない場合もあるのに対して、レンマはきちんと単語になっていますね。

品詞の種類

Stanford Core NLPの品詞の解析は、Stanford Log-linear Part-Of-Speech Taggerというライブラリが使われているようで、解析される品詞の略語（NNとかNNPとかINとか）はThe University of Pennsylvania (Penn) Treebank Tag-setに準拠するようです。

　
55本目のノックは以上です。誤りなどありましたら、ご指摘いただけますと幸いです。

実行結果には、100本ノックで用いるコーパス・データで配布されているデータの一部が含まれます。この第6章で用いているデータのライセンスはクリエイティブ・コモンズ表示-継承 3.0 非移植（日本語訳）です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up