【Python】GitHubActionsでスクレイピングを定期実行してみる

Last updated at 2025-01-21Posted at 2024-09-23

はじめに

GitHub Actionsを使用してスケジュールトリガーで複数のWebサイトをスクレイピングし、データを取得するプログラムを作成します。

スクレイピング

取得したいデータ
各地域の血液センターのウェブサイトでは、以下のような献血状況が公表されています。献血方法には、400mL、200mL、そして成分献血の3種類があり、血液型ごと（A型、O型、B型、AB型）に対して、「安心です」「心配です」「困っています」「非常に困っています」という4段階で献血状況が報告されています。これらの情報をスクレイピングして取得します。

BeautifulSoupとrequestsを使ってスクレイピングをして、CSVファイルに書き込む処理を行います。

ライブラリのインポート

import csv

import requests
from bs4 import BeautifulSoup

WebサイトのURLをリストに格納する
北海道、東北、関東甲信越、東海北陸、近畿、中四国、九州ブロックの7ブロックの献血センターに対して、それぞれのサイトのURLを表すurlと地域コードを表すcodeを割り当てます。また、開発者ツールなどを用いて抽出したい<p>タグのクラス名を特定します。

sites = [
    {'url':'https://www.bs.jrc.or.jp/hkd/hokkaido/index.html', 'code': 1, 'class': 'center-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/th/bbc/index.html', 'code': 2, 'class': 'block-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/ktks/bbc/index.html', 'code': 3, 'class': 'block-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/tkhr/bbc/index.html', 'code': 4, 'class': 'block-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/kk/bbc/index.html', 'code': 5, 'class': 'block-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/csk/bbc/index.html', 'code': 6, 'class': 'block-main-today-types-state'}, 
    {'url':'https://www.bs.jrc.or.jp/bc9/bbc/index.html', 'code': 7, 'class': 'block-main-today-types-state'}
]

スクレイピング処理
response.txetでは文字化けしてしまったのでresponse.contentとしました。

for site in sites:
    response = requests.get(site['url'])
    soup = BeautifulSoup(response.content, "html.parser")
    
    p_tags = soup.find_all('p')

    target_p_tags = [tag for tag in p_tags if site['class'] in tag.get('class', [])]
    target_elements = [tag.text for tag in target_p_tags if tag.text.strip()]

CSVファイルに書き込む
1~3の手順にCSVファイルにスクレイピング結果を書き込む処理を追加します。ここではBloodStock.csvというファイルを用意しました。

# ヘッダー
header = ['block_code', '400-a', '400-o', '400-b', '400-ab', '200-a', '200-o', '200-b', '200-ab', 'com-a', 'com-o', 'com-b', 'com-ab']

# 'w'で上書き
with open('BloodStock.csv', 'w', newline='', encoding='utf-8') as csvfile:
     writer = csv.writer(csvfile)
     writer.writerow(header)

# 'a'で追加書き込み
def write_to_csv(code, data, filename):
    with open(filename, 'a', newline='', encoding='utf-8') as csvfile:
         writer = csv.writer(csvfile)
         writer.writerow([code] + data)

for site in sites:
    response = requests.get(site['url'])
    soup = BeautifulSoup(response.content, "html.parser")

    p_tags = soup.find_all('p')

    target_p_tags = [tag for tag in p_tags if site['class'] in tag.get('class', [])]
    target_elements = [tag.text for tag in target_p_tags if tag.text.strip()]

    write_to_csv(site['code'], target_elements, 'BloodStock.csv')

これでCSVファイルにスクレイピング結果を保存することができました。

GitHub Actionsで自動化

GitHub Actionsのワークフローを定義するymlファイルは、リポジトリのルートディレクトリに.githubディレクトリを作成し、その中にworkflowsというファイルを作成します。

ymlファイル

name：アクションの名前を入力。リポジトリ内で一意の名前を付けてあげます。
```
name: Scrape
```
on：アクションのトリガーを設定
```
on:
  workflow_dispatch:
  schedule:
    - cron: "0 15 * * 2,5"
```
workflow_dispatchは手動でワークフローを起動するためのトリガーです。
リポジトリのActionsから「Run workflow」をクリックすると実行できます。

scheduleはcron式を使用してスケジュールトリガーで実行を行います。cron式は以下のような構造になっています。

＊＊＊＊＊

分時日月曜日

0 - 59 0 - 23(UTC時間) 1 - 31 1 - 12 0 - 7(0,7が日曜)

注意したいのが、時間はUTC時間であることです。UTC時間(協定世界時)UTC = JST - 9時間で表現することができます。現在はUTCに置き換わりつつありますが、GMTと同じ時間を示します。ここでは、毎週水曜日と金曜日の午前0時に実行したいので”0 15 * * 2,5”としています。
jobs:ジョブを設定していきます
```
jobs:
	Scrape:
		runs-on: ubuntu-latest
```
runs-on: ジョブを動作させる仮想環境をUbuntu最新版に指定しています。

＊	＊	＊	＊	＊
分	時	日	月	曜日
0 - 59	0 - 23(UTC時間)	1 - 31	1 - 12	0 - 7(0,7が日曜)

steps:

steps:
        # GitHubリポジトリをチェックアウトする
      - name: Checkout repository
        uses: actions/checkout@v2

        # バージョン3.9でセットアップする
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: "3.9"

        # requirements.txtに書いたライブラリをインストールする
      - name: Install dependencies
        run: |
          pip install -r requirements.txt

        # robot.pyを実行する
      - name: Run scraping script
        run: |
          python robot.py
        
        # 更新されたデータをdevelopブランチにプッシュする
      - name: commit files
        run: |
          git config --global user.name "${{ github.actor }}"
          git config --global user.email "${{ github.actor }}@users.noreply.github.com"
          git add BloodStock.csv
          git commit -m 'update BloodStock.csv'
          git push origin develop

"${{ github.actor }}","${{ github.actor }}@users.noreply.github.com"とすることで実行者のGitHubアカウントでコミットを行います。

403エラー

fatal: unable to access 'https://github.com/******/*****/': The requested URL returned error: 403
Error: Process completed with exit code 128.

コードを実行しようとしたところ、アクセスできないと言われ、403エラーが出てしまいました。調べてみると、GitHub Actionsでの書き込み権限がないようなので書き込み権限を与えます。
SettingsタブからActionsのGeneralを選択します。Workflow permissionsをRead and write permissionsに変更し、Saveすればエラーは解消されました。

定期実行されCSVファイルが自動で更新されるようになりました。

おわりに

GitHub Actionsで簡単にスクレイピングを自動実行できました。スクレイピングで一度に大量のデータを取得するとサーバーに負担をかけてしまう可能性があるので気を付けるひつようがあります。
今回作成したプログラムは以下のリポジトリで公開しています。

参考にしたサイト等

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up