【初心者🔰】Copilotを使ってPythonでスクレイピングしてみました

Posted at 2024-12-20

はじめに

買って寝かしてしまったIT本をようやく開いてさらっと読みしつつやってみました。
・・・結果、めちゃくちゃ楽しくって、こんなに楽しいのに何故やらなかった私 (笑)
きっとCopilotが手助け(いや,全部書いてくれた)し、私は検証しただけだからに違いない！
というわけで、プロンプトで作成・質問したこと、こちら投稿文章もCopilotにまとめてもらいました。

実は最初のコマンドプロンプトが消えてしまいありません！
次回投稿時はその部分もしっかり意識していきたいです。

本は参考にさせてもらい、本の手順・設定すべてを行っているわけではありません。
■参照
著者: いまにゅ
書籍名: 今日からできる！ Python業務効率化スキルが身につく本
出版社: 株式会社 KADOKAWA

■使用したもの
Microsoft VSCode　1.96.1
Microsoft Copilot
Python　2024.22.0

■気を付けなければならないところ
対象URL先がスクレイピングOKか確認
アクセス時間と回数の制限に注意

コード処理（架空のURL使用）

注意：このコードは動作保証を行いません。
環境や設定によっては正常に動作しない場合がありますので、
ご自身の責任においてご使用ください。

import re
import requests
from bs4 import BeautifulSoup
import csv
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from requests.adapters import HTTPAdapter, Retry

# スクレイピングしたい架空のECサイトの基本URLを指定
base_url = 'https://www.example.com/fake-path?p='

# リトライ戦略を設定
session = requests.Session()
retries = Retry(
    total=5,  # リトライ回数の上限を設定
    backoff_factor=1,
    status_forcelist=[500, 502, 503, 504]
)
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))

# Robots.txtの解析
robots_url = 'https://www.example.com/robots.txt'
robots_response = session.get(robots_url)
robots_response.raise_for_status()
robots_text = robots_response.text

# スクレイピングしてはいけないページをリストアップ
disallowed_paths = []
for line in robots_text.split('\n'):
    if line.startswith('Disallow: '):
        disallowed_path = line.split(': ')[1].strip()
        disallowed_paths.append(disallowed_path)

# データを取得する関数
def get_page(url):
    if not url:
        print("エラー: URLが無効です。")
        return None
    try:
        response = session.get(url, timeout=10)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

# 書籍総数を取得する関数
def get_total_books(url):
    html = get_page(url)
    if html is None:
        return 0
    soup = BeautifulSoup(html, 'html.parser')
    counter_text = soup.find('div', class_='counter').text.strip()
    total_books_text = counter_text.split('／全')[-1].replace('件', '').strip()
    total_books = int(total_books_text.split()[0])  # 改行や余分な文字を除去
    return total_books

# 本の情報をスクレイピングする関数
def scrape_books(page_url):
    html = get_page(page_url)
    if html is None:
        return []

    soup = BeautifulSoup(html, 'html.parser')

    # 書籍情報のHTML構造に基づいてクラス名やタグを更新
    book_items = soup.find_all('div', class_='details ml10')
    book_data = []
    for book in book_items:
        try:
            title_tag = book.find_previous('h3', class_='heightLine-2').find('a')
            title = title_tag.text.strip() if title_tag else 'No title found'

            author_tag = book.find('p', class_='clearfix')
            author = author_tag.text.strip() if author_tag else 'No author found'

            price_tag = book.find_next('span', class_='sale_price')
            price = price_tag.text.strip() if price_tag else 'No price found'

            # 出版社と発売日を取得
            details2_tag = book.find_next('div', class_='details2 select_section1 ml10')
            publisher_release_info = details2_tag.find('li').text.strip()
            match = re.match(r'^(.*)（(\d{4}/\d{2})発売）$', publisher_release_info)
            if match:
                publisher = match.group(1)
                release_date = match.group(2)
            else:
                publisher = 'No publisher found'
                release_date = 'No release date found'

            book_data.append([title, author, price, publisher, release_date])
        except AttributeError as e:
            print(f"Error parsing book data from {page_url}: {e}")
    return book_data

# ページURLリストを生成
def generate_page_urls(start_url, max_pages):
    if not start_url:
        print("エラー: 基本URLが無効です。")
        return []
    return [start_url + str(page) for page in range(1, max_pages + 1)]

# メインロジック
start_url = base_url
total_books = get_total_books(start_url + '1')
items_per_page = 20
max_pages = min((total_books + items_per_page - 1) // items_per_page, 10)  # ページ数の計算と10ページまでの制限

all_book_data = []
data_found = False
page_urls = generate_page_urls(start_url, max_pages)

# 並行処理でページをスクレイピング
with ThreadPoolExecutor(max_workers=5) as executor:
    future_to_url = {executor.submit(scrape_books, url): url for url in page_urls if url}
    for future in as_completed(future_to_url):
        url = future_to_url[future]
        try:
            book_data = future.result()
            if book_data:
                all_book_data.extend(book_data)
                data_found = True
                print(f"{url} の書籍データが追加されました: {len(book_data)} 件")
            else:
                print(f"{url} からデータが見つかりませんでした。")
        except Exception as exc:
            print(f"{url} で例外が発生しました: {exc}")

# 収集したデータが存在する場合のみCSVファイルに保存
if data_found:
    with open('books.csv', mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Title', 'Author', 'Price', 'Publisher', 'Release Date'])
        writer.writerows(all_book_data)
    print("全ての書籍データがCSVファイルに保存されました。")
else:
    print("有効なデータが見つかりませんでした。CSVファイルは作成されませんでした。")

スクレイピング処理の概要と詳細

No	関数名	役割	ポイントと注意点
1	`get_page`	指定されたURLからHTMLを取得し、リクエストが失敗した場合にエラーメッセージを表示。	無効なURLの検出とエラーハンドリングを強化。
2	`get_total_books`	サイト内の書籍の総数を取得。	データ取得のための初期設定に使用。
3	`scrape_books`	ページURLから書籍情報（タイトル、著者、価格、出版社、発売日）をスクレイピング。	HTML構造に依存しているため、HTMLの変更に注意。
4	`generate_page_urls`	基本URLと最大ページ数を基に、スクレイピングするページのURLリストを生成。無効なURLも検出。	正しいURLリストの生成と無効なURLの検出。
5	`Robots.txt`の解析	`robots.txt` ファイルを取得し、スクレイピングが禁止されているパスをリストアップ。	サイトポリシーに従ったスクレイピングを実現。
6	リトライ戦略の設定	リクエストが失敗した場合に再試行する回数を設定（最大5回）。	適切なリトライ戦略を設定し、サーバーの負荷を軽減。
7	並行処理の設定	複数のページを並行してスクレイピングし、結果を収集。エラーハンドリングも実施。	スクレイピングの効率を向上し、信頼性を確保。
8	CSVファイル保存処理	収集したデータをCSVファイルに保存。有効なデータが無い場合はファイルを作成せずにエラーメッセージを表示。	データの信頼性と保存処理の安定性を確保。
9	実行ファイルの作成	`pyinstaller`を使用してPythonスクリプトを実行ファイルに変換。	実行環境に依存しないスクリプトの提供。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up