More than 5 years have passed since last update.

自然言語処理100本ノック第6章英語テキストの処理(前半)

Posted at 2015-12-30

第6章の前半の問題を解いた記録。
対象とするファイルはwebページにもある通り、nlp.txtとする。

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

50. 文区切り

(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし，入力された文書を1行1文の形式で出力せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

punt = re.compile(r"(?P<punt>[\.:;!\?]) (?P<head>[A-Z])")

if __name__ == "__main__":
    f = open('nlp.txt', 'r')
    for line in f:
        l = line.strip()
        # if punt.search(l):
            # print punt.sub(r"\g<punt>\n\g<head>", l)
        print punt.sub(r"\g<punt>\n\g<head>", l)
    f.close()

51. 単語の切り出し

空白を単語の区切りとみなし，50の出力を入力として受け取り，1行1単語の形式で出力せよ．ただし，文の終端では空行を出力せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

if __name__ == "__main__":
    f = open('nlp_line.txt', 'r')
    for line in f:
        l = line.strip()
        for word in l.split():
            print re.sub(r"\W", "", word)
        print ""
    f.close()

52. ステミング

51の出力を入力として受け取り，Porterのステミングアルゴリズムを適用し，単語と語幹をタブ区切り形式で出力せよ． Pythonでは，Porterのステミングアルゴリズムの実装としてstemmingモジュールを利用するとよい．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

from nltk.stem.porter import PorterStemmer

if __name__ == "__main__":
    f = open('nlp_word.txt')
    for line in f:
        stemmer = PorterStemmer()
        l = line.strip()
        if len(l) > 0:
            print "%s\t%s" % (l, stemmer.stem(l))
        else:
            print ""
    f.close()

53. Tokenization

Stanford Core NLPを用い，入力テキストの解析結果をXML形式で得よ．また，このXMLファイルを読み込み，入力テキストを1行1単語の形式で出力せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

WORD = re.compile(r"<word>(\w+)</word>")

f = open('nlp.txt.xml', 'r')
for line in f:
    word = WORD.search(line.strip())
    if word:
        print word.group(1)
f.close()

XMLファイルの作成コマンド

Stanford Core NLPをダウンロードし、そのフォルダまで移動。
次のコマンドを実行。

java -Xmx5g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-models-3.6.0.jar:* edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,mention,coref -file nlp_line.txt -outputFormat xml

なぜだかzsh上ではエラーを吐いて動かなかったので、bash上で実行。

54. 品詞タグ付け

Stanford Core NLPの解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

WORD = re.compile(r"<word>(\w+)</word>")
LEMMA = re.compile(r"<lemma>(\w+)</lemma>")
POS = re.compile(r"<POS>(\w+)</POS>")

f = open("nlp.txt.xml", "r")
words = []
for line in f:
    if len(words) == 3:
        print "\t".join(words)
        words = []
    else:
        line = line.strip()
        word = WORD.search(line)
        if len(words) == 0 and word:
            words.append(word.group(1))
            continue
        lemma = LEMMA.search(line)
        if len(words) == 1 and lemma:
            words.append(lemma.group(1))
            continue
        pos = POS.search(line)
        if len(words) == 2 and pos:
            words.append(pos.group(1))
f.close()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

自然言語処理100本ノック 第6章 英語テキストの処理(前半)