全国銀行協会から統計資料を一括取得する

Python

Posted at 2024-08-23

全国銀行協会月次統計データページから、各年の月末ごとの統計情報PDFリンクを自動的に抽出するPythonスクリプトを紹介します。以下のスクリプトは、BeautifulSoupとrequestsライブラリを使用して、メインページからリンクを取得し、それぞれのリンク先からPDFファイルのURLを収集する2つのステップで構成されています。

ステップ1: メインページの内容を取得

まず、指定されたベースURL（https://www.zenginkyo.or.jp/stats/month1-01/）から、年ごとの統計データセクションを抽出します。

import requests
from bs4 import BeautifulSoup
import re

def extract_pdf_links_from_years(base_url):
    # ベースURLのページを取得
    response = requests.get(base_url)
    response.raise_for_status()

    # BeautifulSoupでHTMLを解析
    soup = BeautifulSoup(response.text, 'html.parser')

    # すべての年のセクションを取得
    sections = soup.select('section.section-menu_tableSub')

    all_pdf_links = []

解説:

requests.get(base_url): Pythonのrequestsライブラリを使って指定したURLからページのHTMLコンテンツを取得します。
soup = BeautifulSoup(response.text, 'html.parser'): BeautifulSoupを使って、取得したHTMLを解析可能なオブジェクトに変換します。
sections = soup.select('section.section-menu_tableSub'): 取得したHTMLから、各年ごとの統計データセクション（例: 2024年（令和6年））を抽出します。

ステップ2: 取得したリンクへアクセス

次に、各年のセクションから月末リンクを抽出し、それぞれのリンク先のページからPDFリンクを取得します。

    # 各年のセクションをループ
    for section in sections:
        # 年の取得
        year_heading = section.select_one('header.section-heading h3.heading-main').get_text(strip=True)
        year = re.search(r'\d{4}', year_heading).group(0)

        # 各月末のリンクとテキストを取得
        month_links = section.select('td.item a')

        for link in month_links:
            # 月末のテキストをYYYY-MM形式に変換
            month_text = link.get_text(strip=True)
            month_match = re.search(r'(\d{1,2})月末', month_text)
            if month_match:
                month = month_match.group(1).zfill(2)
                formatted_month = f"{year}-{month}"

                # PDFリンクの取得
                full_month_url = f"https://www.zenginkyo.or.jp{link.get('href')}"
                month_response = requests.get(full_month_url)
                month_response.raise_for_status()

                # BeautifulSoupで解析
                month_soup = BeautifulSoup(month_response.text, 'html.parser')
                for article in month_soup.select('section.items .item a'):
                    pdf_link = article.get('href')
                    if pdf_link.endswith('.pdf'):
                        full_pdf_url = f"https://www.zenginkyo.or.jp{pdf_link}"
                        all_pdf_links.append((formatted_month, full_pdf_url))

    return all_pdf_links

解説:

年と月末のリンクを取得:

year_heading = section.select_one('header.section-heading h3.heading-main').get_text(strip=True): 各セクションから年の情報を取得し、YYYY形式に変換します。
month_links = section.select('td.item a'): 各年のセクション内から月末ごとのリンクを抽出します。

月末リンクへアクセスしPDFを取得:

full_month_url = f"https://www.zenginkyo.or.jp{link.get('href')}": 月末リンクをフルURLに変換し、requestsで再度アクセスします。
month_soup.select('section.items .item a'): 各リンク先ページ内からPDFリンクを抽出します。

結果の出力

抽出した年月（YYYY-MM形式）とそれに対応するPDFリンクを表示します。

# 使用例
base_url = 'https://www.zenginkyo.or.jp/stats/month1-01/'  # ベースURL
pdf_links_with_month = extract_pdf_links_from_years(base_url)

# 取得した年月とPDFリンクを表示
for month, pdf in pdf_links_with_month:
    print(f"{month}: {pdf}")

解説:

print(f"{month}: {pdf}"): 最後に、抽出された年月と対応するPDFリンクを表示します。この形式で出力されるため、データの整理や確認が容易になります。

コード全体

import requests
from bs4 import BeautifulSoup
import re

def extract_pdf_links_from_years(base_url):
    # ベースURLのページを取得
    response = requests.get(base_url)
    response.raise_for_status()

    # BeautifulSoupでHTMLを解析
    soup = BeautifulSoup(response.text, 'html.parser')

    # すべての年のセクションを取得
    sections = soup.select('section.section-menu_tableSub')

    all_pdf_links = []

    # 各年のセクションをループ
    for section in sections:
        # 年の取得
        year_heading = section.select_one('header.section-heading h3.heading-main').get_text(strip=True)
        year = re.search(r'\d{4}', year_heading).group(0)

        # 各月末のリンクとテキストを取得
        month_links = section.select('td.item a')

        for link in month_links:
            # 月末のテキストをYYYY-MM形式に変換
            month_text = link.get_text(strip=True)
            month_match = re.search(r'(\d{1,2})月末', month_text)
            if month_match:
                month = month_match.group(1).zfill(2)
                formatted_month = f"{year}-{month}"

                # PDFリンクの取得
                full_month_url = f"https://www.zenginkyo.or.jp{link.get('href')}"
                month_response = requests.get(full_month_url)
                month_response.raise_for_status()

                # BeautifulSoupで解析
                month_soup = BeautifulSoup(month_response.text, 'html.parser')
                for article in month_soup.select('section.items .item a'):
                    pdf_link = article.get('href')
                    if pdf_link.endswith('.pdf'):
                        full_pdf_url = f"https://www.zenginkyo.or.jp{pdf_link}"
                        all_pdf_links.append((formatted_month, full_pdf_url))

    return all_pdf_links

# 使用例
base_url = 'https://www.zenginkyo.or.jp/stats/month1-01/'  # ベースURL
pdf_links_with_month = extract_pdf_links_from_years(base_url)

# 取得した年月とPDFリンクを表示
for month, pdf in pdf_links_with_month:
    print(f"{month}: {pdf}")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up