More than 5 years have passed since last update.

自然言語処理100本ノック第3章正規表現(前半)

Posted at 2015-10-29

第3章の前半の問題を解いた記録。
対象とするファイルはwebページにもある通り、jawiki-country.json.gzを伸長したjawiki-country.jsonとする。

Wikipediaの記事を以下のフォーマットで書き出したファイルjawiki-country.json.gzがある．
1行に1記事の情報がJSON形式で格納される
各行には記事名が"title"キーに，記事本文が"text"キーの辞書オブジェクトに格納され，そのオブジェクトがJSON形式で書き出される
ファイル全体はgzipで圧縮される
以下の処理を行うプログラムを作成せよ．

20. JSONデータの読み込み

Wikipedia記事のJSONファイルを読み込み，「イギリス」に関する記事本文を表示せよ．問題21-29では，ここで抽出した記事本文に対して実行せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re
import json

inputfile = 'jawiki-country.json'
outputfile = 'jawiki-england.txt'

f = open(inputfile)
g = open(outputfile, 'w')

target = re.compile(u'イギリス')

for line in f:
    res = json.loads(line)
    if target.search(res[u'text']):
        g.write(res['text'].encode('utf8') + '\n')
f.close()
g.close()

# => (ファイルjawiki-england.txtに出力)

reモジュールを利用。
日本語をunicode文字列として扱うため、u'イギリス'という形で書く。
reモジュールのcompileメソッドにて正規表現にコンパイルし、searchメソッドで各行にtarget（イギリス）が含まれているかを判定する。

21. カテゴリ名を含む行を抽出

記事中でカテゴリ名を宣言している行を抽出せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_category.txt'

f = open(inputfile)
g = open(outputfile, "w")

category = re.compile('\[\[Category:.+\]\]')

for line in f:
    if category.match(line):
        g.write(line.strip() + "\n")

f.close()
g.close()

# => (ファイルjawiki-england_category.txtに出力)

前問題と同じ。
[[Category:〜]]を含むかどうかの判定。
正規表現内のため、[ や ] をエスケープ。

22. カテゴリ名の抽出

記事のカテゴリ名を（行単位ではなく名前で）抽出せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_category-name.txt'

f = open(inputfile)
g = open(outputfile, "w")

category = re.compile('\[\[Category:(.+)\]\]')

for line in f:
    l = category.match(line)
    if l:
        g.write(l.group(1) + "\n")

f.close()
g.close()

# => (ファイルjawiki-england_category-name.txtに出力)

reモジュールのgroupメソッドでカテゴリ名を取得。
正規表現コンパイル時に丸括弧で囲んだ部分 (.+) によってパターン化された部分を取得できる。
引数が 0 であればマッチした全体、数値であればその個数目のパターン化された部分が返ってくる（数値がパターン数より大きければIndexErrorとなる）。

23. セクション構造

記事中に含まれるセクション名とそのレベル（例えば"== セクション名 =="なら1）を表示せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_section.txt'

f = open(inputfile)
g = open(outputfile, "w")

section = re.compile(r'=(=+) (.+) =')

for line in f:
    l = section.match(line)
    if l:
        g.write("sec%s : " % len(l.group(1)))
        g.write(l.group(2) + "\n")

f.close()
g.close()

# => (ファイルjawiki-england_section.txtに出力)

前問題と同じく、groupメソッドを利用。
セクションのレベルは = の数で判断。

24. ファイル参照の抽出

記事から参照されているメディアファイルをすべて抜き出せ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'jawiki-england.txt'
outputfile = 'jawiki-england_media.txt'

f = open(inputfile)
g = open(outputfile, "w")

mediafile = re.compile(r".*(ファイル|File|file):(.*\.[a-zA-Z0-9]+)\|.*")

for line in f:
    l = mediafile.match(line)
    if l:
        g.write(l.group(2) + "\n")

f.close()
g.close()

# => (ファイルjawiki-england_media.txtに出力)

ここまでの問題と同様。

groupの使い方がややこしく感じたけど、数問解いたところでなんとなくわかってきた。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

自然言語処理100本ノック 第3章 正規表現(前半)