More than 5 years have passed since last update.

Wikipediaの国旗画像を一括ダウンロードする【Python】【BeautifulSoup】

Last updated at 2020-09-13Posted at 2020-03-22

概要

Wikipediaの国旗の一覧に載っている国旗画像をpythonでダウンロード（スクレイピング）しました。

参考

環境

macOS Catalina
python3.8.0

ライブラリのインストール

pip install beautifulsoup4
pip install requests
pip install urllib
pip install lxml

実装

figsというフォルダを予め作成した状態で実行してください。

from bs4 import BeautifulSoup
import requests
import urllib
import os
import time

# wikipedia「国旗の一覧」のurl
wiki_url = "https://ja.wikipedia.org/wiki/%E5%9B%BD%E6%97%97%E3%81%AE%E4%B8%80%E8%A6%A7"

# htmlソースの取得とパース
html_text = requests.get(wiki_url).text
soup = BeautifulSoup(html_text,"lxml")

# imgタグの取得
imgs = soup.find_all("img")
# 国旗画像URLの取得
flag_urls = []
for tag in imgs:
    #国旗画像のimgタグはalt属性が"〇〇の旗"という形式になっているので（2020年3月22日現在）、"旗"が含まれるものだけ処理する。
    if "旗" not in tag.get("alt"):
        continue
    url = tag.get("src") #src属性（urlの相対パス）を取得
    url = "https:"+url #https:を先頭につけて絶対URLにする
    flag_urls.append(url)

for url in flag_urls:
    #ダウンロード先のパスの指定
    #各URLの末尾は"125px-Flag_of_国名.svg.png"のようになっている。ここから"Flag_of_国名.png"をダウンロード後のファイル名にする
    png_name = url.split("px-")[-1].split(".")[0]+".png"
    #figsというディレクトリの下に保存する。figsは予め作っておく
    png_name = os.path.join("./figs",png_name)
    #ファイルが存在しない場合のみダウンロードする
    if os.path.exists(png_name): 
        print("File",png_name,"already exists")
        continue
    urllib.request.urlretrieve(url,png_name)
    print("File",png_name,"downloaded")
    #サーバに負荷を与えないため待機
    time.sleep(1)

ファイル名が一部文字化けしていましたが、ダウンロードは無事できました。

おまけ

国旗ファイル名と日本語国名の対応表が欲しい場合は上記コードのimgs変数のalt属性から情報を取得できます

from bs4 import BeautifulSoup
import requests
import urllib
import os
import time

# wikipedia「国旗の一覧」のurl
wiki_url = "https://ja.wikipedia.org/wiki/%E5%9B%BD%E6%97%97%E3%81%AE%E4%B8%80%E8%A6%A7"

# htmlソースの取得とパース
html_text = requests.get(wiki_url).text
soup = BeautifulSoup(html_text,"lxml")
# imgタグの取得
imgs = soup.find_all("img")
# 国名とファイル名の対応表を作る
table = []
for tag in imgs:
    #国旗画像のimgタグはalt属性が"〇〇の旗"という形式になっているので（2020年3月22日現在）、"旗"が含まれるものだけ処理する。
    if "旗" not in tag.get("alt"):
        continue
    country_name = tag.get("alt")[:-2] #"〇〇の旗"の"〇〇"(最後2文字以外）を取得
    url = tag.get("src") #src属性（urlの相対パス）を取得
    #各URLの末尾は"125px-Flag_of_国名.svg.png"のようになっている。ここから"Flag_of_国名.png"をダウンロード後のファイル名にする
    png_name = url.split("px-")[-1].split(".")[0]+".png"
    table.append(country_name+','+png_name)

with open("name_to_img.csv", "w") as f:
    f.write("\n".join(table))

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up