More than 5 years have passed since last update.

素人の言語処理100本ノック:51

Last updated at 2017-05-03Posted at 2016-12-15

言語処理100本ノック 2015の挑戦記録です。環境はUbuntu 16.04 LTS ＋ Python 3.5.2 :: Anaconda 4.1.1 (64-bit)です。過去のノックの一覧はこちらからどうぞ。

第6章: 英語テキストの処理

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

51. 単語の切り出し

空白を単語の区切りとみなし，50の出力を入力として受け取り，1行1単語の形式で出力せよ．ただし，文の終端では空行を出力せよ．

出来上がったコード：

main.py

# coding: utf-8
import re

fname = 'nlp.txt'


def nlp_lines():
	'''nlp.txtを1文ずつ読み込むジェネレータ
	nlp.txtを順次読み込んで1文ずつ返す

	戻り値：
	1文の文字列
	'''
	with open(fname) as lines:

		# 文切り出しの正規表現コンパイル
		pattern = re.compile(r'''
			(
				^					# 行頭
				.*?					# 任意のn文字、最少マッチ
				[\.|\;|\:|\?|\!]	# . or ; or : or ? or !
			)
			\s						# 空白文字
			(
				[A-Z].*				# 英大文字以降（＝次の文以降)

			)
		''', re.MULTILINE + re.VERBOSE + re.DOTALL)

		for line in lines:

			line = line.strip()		# 前後の空白文字除去
			while len(line) > 0:

				# 行から1文を取得
				match = pattern.match(line)
				if match:

					# 切り出した文を返す
					yield match.group(1)		# 先頭の文
					line = match.group(2)		# 次の文以降

				else:

					# 区切りがないので、最後までが1文
					yield line
					line = ''


def nlp_words():
	'''nlp.txtを1単語ずつ返すジェネレータ
	文の終わりでは空文字を返す。

	戻り値：
	1単語、ただし文の終わりでは空文字を返す
	'''
	for line in nlp_lines():

		# 単語に分解、終端の区切り文字は除去して返す
		for word in line.split(' '):
			yield word.rstrip('.,;:?!')

		# 文の終わりは空文字
		yield ''


# 読み込み
for word in nlp_words():
	print(word)

実行結果：

長いので先頭部分の抜粋です。

端末：先頭部分

Natural
language
processing

From
Wikipedia
the
free
encyclopedia

Natural
language
processing
(NLP)
is
a
field
of
computer
science
artificial
intelligence
and
linguistics
concerned
with
the
interactions
between
computers
and
human
(natural)
languages

As
such
NLP
is
related
to
the
area
of
humani-computer
interaction

Many
challenges
in
NLP
involve
natural
language
understanding
that
is
enabling
computers
to
derive
meaning
from
human
or
natural
language
input
and
others
involve
natural
language
generation

全体の結果はGitHubにアップしています。

さらにジェネレータ

前問で1文ずつ取得するジェネレータnlp_lines()を作りましたが、さらに1単語ずつ取得するジェネレータnlp_words()を作ってみました。

（2016/12/17更新）　
前問のnlp_lines()の修正を反映しました。

　
52本目のノックは以上です。誤りなどありましたら、ご指摘いただけますと幸いです。

実行結果には、100本ノックで用いるコーパス・データで配布されているデータの一部が含まれます。この第6章で用いているデータのライセンスはクリエイティブ・コモンズ表示-継承 3.0 非移植（日本語訳）です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up