0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

言語処理100本ノック(2020): 22

Posted at
"""
22. カテゴリ名の抽出
記事のカテゴリ名を(行単位ではなく名前で)抽出せよ.
"""

import json
import re


def get_uk_text(path):
    with open(path) as f:
        for line in f:
            line_data = json.loads(line)
            if line_data["title"] == "イギリス":
                data = line_data
                break
    return data["text"]


uk_text = get_uk_text("jawiki-country.json")
uk_text_list = uk_text.split("\n")
ans = [x for x in uk_text_list if "Category:" in x[:11]]
# ans:
# [[Category:イギリス連邦加盟国]]
# [[Category:英連邦王国|*]]
# [[Category:G8加盟国]]
# [[Category:欧州連合加盟国|元]]
# [[Category:海洋国家]]
# [[Category:現存する君主国]]
# [[Category:島国]]
# [[Category:1801年に成立した国家・領域]]


# ans22
def extract_category_value(string: str) -> str:
    """
    https://docs.python.org/3/library/re.html#regular-expression-syntax

    - re.VERBOSE  allow us add command to explain the regular expression
    - re.S        allow to recognize '\n'
    - (...)       matches whatever regular expression is inside the parentheses,
    - (?:...)     a non-capturing version of regular parentheses.
    - ?           causes the resulting RE to match 0 or 1 repetitions
    - *?          the '*' qualifier is greedy.
                  Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
                  e.g. <.*> is matched against '<a> b <c>'
                  e.g. <.*?> will match only '<a>'
    Input: [[Category:イギリス|*]]
    Output: 'イギリス'
    """
    pattern = re.compile(
        r"""
        ^       # 行頭
        .*      # 任意の文字0文字以上
        \[\[Category:
        (       # キャプチャ対象のグループ開始
        .*?     # 任意の文字0文字以上、非貪欲マッチ(貪欲にすると後半の'|'で始まる装飾を巻き込んでしまう)
        )       # グループ終了
        (?:     # キャプチャ対象外のグループ開始
        \|.*    # '|'に続く0文字以上
        )?      # グループ終了、0か1回の出現
        \]\]
        .*      # 任意の文字0文字以上
        $       # 行末
        """,
        re.VERBOSE | re.S,
    )
    result = re.findall(pattern, string)[0]
    return result


category_values = [extract_category_value(s) for s in ans]
print(category_values)
# ['イギリス',
#  'イギリス連邦加盟国',
#  '英連邦王国',
#  'G8加盟国',
#  '欧州連合加盟国',
#  '海洋国家',
#  '現存する君主国',
#  '島国',
#  '1801年に成立した国家・領域']

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?