アトランタ連邦銀行のビジネス不確実性に関する調査結果のPDFリンクを取得する

Python

Posted at 2024-08-25

本記事では、Pythonを使用してアトランタ連邦準備銀行のビジネス不確実性に関する調査結果ページから、PDFリンクを yyyy-mm 形式の日付と共に取得するスクリプトを作成します。

ステップ1: Webページの取得とHTML解析

まず最初に、Webページのコンテンツを取得して解析します。

def extract_pdf_links_from_atlantafed(base_url):
    # ベースURLのページを取得
    response = requests.get(base_url)
    response.raise_for_status()

ここでは、指定したbase_urlからHTMLコンテンツを取得しています。requests.get(base_url)でHTTPリクエストを送信し、レスポンスが正常に返ってきたかどうかをraise_for_status()で確認します。レスポンスが正常であれば、次に進みます。

    # BeautifulSoupでHTMLを解析
    soup = BeautifulSoup(response.text, 'html.parser')

次に、BeautifulSoupを使って取得したHTMLを解析します。これにより、HTMLドキュメント全体が操作可能なオブジェクトとして扱えるようになり、特定の要素を抽出する準備が整います。

ステップ2: 指定されたセクションからPDFリンクと日付を抽出

次に、特定のHTMLセクションからPDFリンクとそれに対応する日付を抽出します。

    # 対象となる<div>の部分を取得
    tab_content = soup.find('div', id='Tab2')

findメソッドを使用して、特定の

タグ（idがTab2の要素）を取得します。これにより、PDFリンクが含まれるセクションのみを操作対象にします。

    # リストアイテムの中でリンクを持つ<a>タグを探す
    pdf_links = []
    for link in tab_content.find_all('a', href=True):
        href = link['href']
        if '/documents/datafiles/research/surveys/business-uncertainty/chart-pack/' in href:
            # フルURLを生成
            full_url = f"https://www.atlantafed.org{href}"
            
            # 日付のテキストを抽出してyyyy-mm形式に変換
            date_text = link.text.strip()
            date_match = re.search(r'(\w+)\s(\d{4})', date_text)
            if date_match:
                month_str = date_match.group(1)
                year_str = date_match.group(2)
                # 月を数値に変換
                month_dict = {
                    'January': '01', 'February': '02', 'March': '03', 'April': '04',
                    'May': '05', 'June': '06', 'July': '07', 'August': '08',
                    'September': '09', 'October': '10', 'November': '11', 'December': '12'
                }
                month_num = month_dict.get(month_str, '01')
                formatted_date = f"{year_str}-{month_num}"

                # リンクとともにリストに追加
                pdf_links.append((formatted_date, full_url))

この部分では、特定のセクション内のaタグを持つすべてのリンクを検索し、href属性に/documents/datafiles/research/surveys/business-uncertainty/chart-pack/が含まれているものを抽出します。これにより、目的のPDFリンクのみを取得できます。

リンクのテキスト部分から日付情報を抽出し、正規表現を使用してyyyy-mm形式に変換します。変換された日付とリンクをリストに追加します。

結果の表示

最後に、抽出した日付とPDFリンクを表示します。

# 使用例
base_url = 'https://www.atlantafed.org/research/surveys/business-uncertainty.aspx#Tab2'
pdf_links_with_dates = extract_pdf_links_from_atlantafed(base_url)

# 取得した日付とPDFリンクを表示
for date, pdf in pdf_links_with_dates:
    print(f"{date}: {pdf}")

スクリプトの全体像

以下は、PythonでWebスクレイピングを行い、アトランタ連邦準備銀行のページからPDFリンクと日付を抽出するスクリプトです。

import requests
from bs4 import BeautifulSoup
import re

def extract_pdf_links_from_atlantafed(base_url):
    # ステップ1: ベースURLのページを取得して解析
    response = requests.get(base_url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')

    # ステップ2: 指定されたセクションからPDFリンクと日付を抽出
    tab_content = soup.find('div', id='Tab2')
    pdf_links = []
    for link in tab_content.find_all('a', href=True):
        href = link['href']
        if '/documents/datafiles/research/surveys/business-uncertainty/chart-pack/' in href:
            full_url = f"https://www.atlantafed.org{href}"
            date_text = link.text.strip()
            date_match = re.search(r'(\w+)\s(\d{4})', date_text)
            if date_match:
                month_str = date_match.group(1)
                year_str = date_match.group(2)
                month_dict = {
                    'January': '01', 'February': '02', 'March': '03', 'April': '04',
                    'May': '05', 'June': '06', 'July': '07', 'August': '08',
                    'September': '09', 'October': '10', 'November': '11', 'December': '12'
                }
                month_num = month_dict.get(month_str, '01')
                formatted_date = f"{year_str}-{month_num}"
                pdf_links.append((formatted_date, full_url))

    return pdf_links

# 使用例
base_url = 'https://www.atlantafed.org/research/surveys/business-uncertainty.aspx#Tab2'
pdf_links_with_dates = extract_pdf_links_from_atlantafed(base_url)

# 取得した日付とPDFリンクを表示
for date, pdf in pdf_links_with_dates:
    print(f"{date}: {pdf}")

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up