0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

言語処理100本ノック(2020): 25

Posted at
"""
25. テンプレートの抽出
記事中に含まれる「基礎情報」テンプレートのフィールド名と値を抽出し,辞書オブジェクトとして格納せよ.
"""

import json
import re

import utils


def get_uk_text(path):
    with open(path) as f:
        for line in f:
            line_data = json.loads(line)
            if line_data["title"] == "イギリス":
                data = line_data
                break
    return data["text"]


uk_text = get_uk_text("jawiki-country.json")
# See uk_text.txt


# ans24
def get_basic_info(string: str) -> str:
    """Get basic information section
    """
    pattern = re.compile(
        r"""
            ^\{\{基礎情報.*?$   # '{{基礎情報'で始まる行
            (.*?)       # キャプチャ対象、任意の0文字以上、非貪欲
            ^\}\}$      # '}}'で終わる行
        """,
        re.MULTILINE | re.DOTALL | re.VERBOSE,
    )

    return re.findall(pattern, string)[0]


def get_content(string: str) -> list:
    r"""
    https://docs.python.org/3/library/re.html#regular-expression-syntax

    RE:
        - re.X (re.VERBOSE)     Allow us add command to explain the regular expression
        - re.M (re.MULTILINE)   Apply match to each line. If not specified, only match the first line.
        - re.S (re.DOTALL)      Allow to recognize '\n'
        - ^\|       String begin with |
        - ?         Causes the resulting RE to match 0 or 1 repetitions

        - *?        The '*' qualifier is greedy.
                    Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
                    e.g. <.*> is matched against '<a> b <c>'
                    e.g. <.*?> will match only '<a>'

        - (...)     Matches whatever regular expression is inside the parentheses,
        - (?=...)   Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion.
                    For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
        - (?:...)   A non-capturing version of regular parentheses.

    Input:
        - '|国章リンク =([[イギリスの国章|国章]])'
    Return:
        - {"国章リンク": "([[イギリスの国章|国章]])"}
    """
    pattern = re.compile(
        r"""
            ^\|         # '|'で始まる行
            (.+?)       # キャプチャ対象(フィールド名)、任意の1文字以上、非貪欲
            \s*         # 空白文字0文字以上
            =
            \s*         # 空白文字0文字以上
            (.+?)       # キャプチャ対象(値)、任意の1文字以上、非貪欲
            (?:         # キャプチャ対象外のグループ開始
                (?=\n\|)    # 改行+'|'の手前(肯定の先読み)
                |           # または
                (?=\n$)     # 改行+終端の手前(肯定の先読み)
            )               # グループ終了
            """,
        re.MULTILINE | re.DOTALL | re.VERBOSE,
    )
    result = re.findall(pattern, string)
    return {k: v for k, v in result}  # dict is ordered due to python 3.7


basic_info = get_basic_info(uk_text)
# print(basic_info[-100:])
# |国際電話番号 = 44
# |注記 = <references/>

result = get_content(basic_info)
utils.save_json(result, "25_en_basic_info.json")

for r in result.items():
    print(r)
# ('略名', 'イギリス')
# ('日本語国名', 'グレートブリテン及び北アイルランド連合王国')
# ...
# ('国際電話番号', '44')
# ('注記', '<references/>')

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?