0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

言語処理100本ノック 第3章 解いてみた

Posted at

はじめに

  • No.20でいきなり手が止まりました
  • 何とか最後まで解いたものの、No.25がだいぶ怪しいです
    必要以上に情報を削ってしまっている気がするので、後で解きなおします
  • 名前しか聞いたことのなかったpandasを初めて使いました
  • 次章以降も頑張ります

20. JSONデータの読み込み

import pandas as pd

df = pd.read_json(f"{dirpath}/jawiki-country.json.gz", lines=True)
uk_data = df[df.title == 'イギリス']
uk_text = uk_data['text'].values[0]
print(uk_text)

21. カテゴリ名を含む行を抽出

import re

pattern = "\[\[Category:.*"
categories = re.findall(pattern, uk_text)
for line in categories:
  print(line)

22. カテゴリ名の抽出

pattern = "\[\[Category:(.*)\]\]"
categories = re.findall(pattern, uk_text)
for category in categories:
  print(category.replace("|*", ""))

23. セクション構造

pattern = "={2,}.*={2,}"
sections = re.findall(pattern, uk_text)
for section in sections:
  count = 0
  for c in section:
    if c == "=":
      count += 1
  print(f"レベル:{(count - 2) // 2} {section.replace('=', '')}")

24. ファイル参照の抽出

pattern = "\[\[ファイル:(.*?)\|"
media_files = re.findall(pattern, uk_text)
for media_file in media_files:
  print(media_file)

25. テンプレートの抽出

pattern = "\|(.*?) = (.*)"
info_list = re.findall(pattern, uk_text)
info_dict = {}
for info in info_list:
  info_dict[info[0]] = info[1]
print(info_dict)

26. 強調マークアップの除去

pattern = "\|(.*?) = (.*)"
info_list = re.findall(pattern, uk_text)
info_dict = {}
for info in info_list:
  info_dict[info[0]] = re.sub("'{2,5}", "", info[1])
print(info_dict)

27. 内部リンクの除去

pattern = "\|(.*?) = (.*)"
info_list = re.findall(pattern, uk_text)
info_dict = {}
for info in info_list:
  txt = re.sub("'{2,5}", "", info[1])
  if re.search("\[\[(?!ファイル)(.*?)\]\]", txt):
    txt = re.sub("\[\[(?!ファイル)(.*?)\]\]", "\\1", txt)
  info_dict[info[0]] = txt
print(info_dict)

28. MediaWikiマークアップの除去

pattern = "\|(.*?) = (.*)"
info_list = re.findall(pattern, uk_text)
info_dict = {}
for info in info_list:
  txt = re.sub("'{2,5}", "", info[1])
  if re.search("\[\[(?!ファイル)(.*?)\]\]", txt):
    txt = re.sub("\[\[(?!ファイル)(.*?)\]\]", "\\1", txt)

  txt = re.sub("\{\{.*?\}\}", "", txt)
  txt = re.sub("<.*?>.*?<.*?>", "", txt)
  txt = re.sub("<.*?>", "", txt)
  txt = re.sub("\[\[ファイル:.*?\]\]", "", txt)

  info_dict[info[0]] = txt

for k in info_dict.keys():
  print(k, " : ", info_dict.get(k))

29. 国旗画像のURLを取得する

import requests

session = requests.Session()

url = "https://ja.wikipedia.org/w/api.php"

image_name = info_dict.get("国旗画像")

params = {
    "action": "query",
    "format": "json",
    "prop": "imageinfo",
    "titles": "File:" + image_name,
    "iiprop": "url"
}

response_json = session.get(url=url, params=params).json()
image_url = response_json["query"]["pages"]["-1"]["imageinfo"][0]["url"]

print(image_url)
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?