More than 5 years have passed since last update.

自然言語処理100本ノック第3章正規表現(後半)

Posted at 2015-10-29

第3章の後半の問題を解いた記録。

25. テンプレートの抽出

記事中に含まれる「基礎情報」テンプレートのフィールド名と値を抽出し，辞書オブジェクトとして格納せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re
import json

def pass_string(string):
    return string

def fundamental_data(file, func):
    response = []
    data = {}
    start = re.compile(r"\{\{基礎情報")
    end = re.compile(r"\}\}")
    row = re.compile(r"^\s?\|?\s?(.+?)\s?=(.*)\|?")
    flag = False

    pre_key = None
    for line in file:
        if start.match(line):
            flag = True
            continue

        if end.match(line):
            flag = False

        if flag:
            l = row.match(line)

            if l:
                data[l.group(1).strip()] = func(l.group(2).strip())
                pre_key = l.group(1).strip()

            else:
                m = re.match(r"(.*)\}\}$", line)

                if m:
                    data[pre_key] += func(m.group(1).strip())
                    flag = False

                else:
                    data[pre_key] += func(line.strip())
        else:
            if len(data) > 0:
                response.append(data.copy())
                data.clear()

    return response

if __name__ == "__main__":
    inputfile = 'jawiki-england.txt'
    outputfile = 'jawiki-england_fundamental.json'
    f = open(inputfile)
    res = fundamental_data(f, pass_string)
    with open(outputfile, 'w') as g:
        json.dump(res, g, ensure_ascii=False)

# => (ファイルjawiki-england_fundamental.jsonに出力)

基礎情報にテンプレートは {{基礎情報 から、}} までである。
一部例外があり、}} が改行されず行末にある場合が見られたので、その部分に対して例外処理をかけている。
(この後の問題のためにfundamental_data関数の2つ目の引数に関数を渡すようにしている)

26. 強調マークアップの除去

25の処理時に，テンプレートの値からMediaWikiの強調マークアップ（弱い強調，強調，強い強調のすべて）を除去してテキストに変換せよ（参考: マークアップ早見表）．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re
import json
import problem25

def remove_emphasis(string):
    emphasis = re.compile(r"''('*)(.+)''\1")
    return emphasis.sub(r"\2", string)

if __name__ == "__main__":
    inputfile = 'jawiki-england.txt'
    outputfile = 'jawiki-england_fundamental-rmEmpha.json'
    f = open(inputfile)
    res = problem25.fundamental_data(f, remove_emphasis)
    with open(outputfile, 'w') as g:
        json.dump(res, g, ensure_ascii=False)

# => (ファイルjawiki-england_fundamental-rmEmpha.jsonに出力)

前問題で作成したfundamental_data関数に強調マークアップを取り除く関数を渡している。
正規表現内の \number は group メソッドと同様に number番目のパターンを表す。

27. 内部リンクの除去

26の処理に加えて，テンプレートの値からMediaWikiの内部リンクマークアップを除去し，テキストに変換せよ（参考: マークアップ早見表）．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re
import json
import problem25
import problem26

def remove_internalLink(string):
    internallink = re.compile(r"\[\[((.+?)\|)?(.+?)\]\]")
    emphasis_removed = problem26.remove_emphasis(string)
    return internallink.sub(r"\3", emphasis_removed)

if __name__ == "__main__":
    inputfile = 'jawiki-england.txt'
    outputfile = 'jawiki-england_fundamental-rmEmpha-rmLink.json'
    f = open(inputfile)
    res = problem25.fundamental_data(f, remove_internalLink)
    with open(outputfile, 'w') as g:
        json.dump(res, g, ensure_ascii=False)

# => (ファイルjawiki-england_fundamental-rmEmpha-rmLink.jsonに出力)

前問題の処理に加えて、内部リンクマークアップも除去する処理を追加。

28. MediaWikiマークアップの除去

27の処理に加えて，テンプレートの値からMediaWikiマークアップを可能な限り除去し，国の基本情報を整形せよ．

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re
import json
import problem25
import problem26
import problem27

def remove_markup(string):
    markups = [
        re.compile(r"\[https?://[a-zA-Z0-9\./]+\s(.+)?\]"),
        re.compile(r"#REDIRECT\s?(.+)"),
        re.compile(r"<!--\s?(.+)\s?-->"),
        re.compile(r"\{\{.*[Ll]ang\|[a-zA-Z\-]+\|(.+)\}\}"),
        re.compile(r"(.*)<ref.+(</ref>)?>"),
        re.compile(r"(.*?)<br\s?/?>"),
        re.compile(r"<[a-z]+.*>(.*?)</[a-z]+>")
    ]
    removed_string = problem27.remove_internalLink(string)
    for m in markups:
        removed_string = m.sub(r"\1", removed_string)
    return removed_string

if __name__ == "__main__":
    inputfile = 'jawiki-england.txt'
    outputfile = 'jawiki-england_fundamental-rmMUs.json'
    f = open(inputfile)
    res = problem25.fundamental_data(f, remove_markup)
    with open(outputfile, 'w') as g:
        json.dump(res, g, ensure_ascii=False)

# => (ファイルjawiki-england_fundamental-rmMUs.jsonに出力)

見つけられたマークアップとしては、外部リンク、リダイレクト、コメント、言語情報、HTML文であったので、それらを除去するように機能を追加。

29. 国旗画像のURLを取得する

テンプレートの内容を利用し，国旗画像のURLを取得せよ．（ヒント: MediaWiki APIのimageinfoを呼び出して，ファイル参照をURLに変換すればよい）

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import requests
import json

inputfile = 'jawiki-england_fundamental-rmMUs.json'
outputfile = 'jawiki-england_nationalflags.txt'

with open(inputfile, 'r') as f:
    template = json.load(f)

wikipedia_api = "http://ja.wikipedia.org/w/api.php?"
prop = {
    'action': "query",
    'prop': "imageinfo",
    'iiprop': "url",
    'format': "json",
    'formatversion': '2',
    'utf8': '',
    'continue': ''
}

g = open(outputfile, "w")

for country in template:
    if u'国旗画像' in country:
        countryname = country[u'略名']
        filename = country[u'国旗画像']
        prop['titles'] = "Image:" + filename
        res = requests.get(url=wikipedia_api, params=prop)
        datum = json.loads(res.text)
        try:
            file_url = datum['query']['pages'][0]['imageinfo'][0]['url']
        except:
            print(datum)
            break
        print(filename, file_url)
        g.write(countryname.encode('utf8').replace('|', ''))
        g.write(", %s\n" % file_url)
g.close()

# => (ファイルjawiki-england_nationalflags.txtに出力)

requestsモジュールを利用してAPIを叩く。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

自然言語処理100本ノック 第3章 正規表現(後半)