【Python】GitHubActionsでスクレイピングを定期実行させる

Last updated at 2024-09-23Posted at 2024-09-23

はじめに

GitHub Actionsを使用して、スケジュールトリガーで複数のWebサイトをスクレイピングし、データを取得するプログラムを作成します。

スクレイピング

取得したいデータ
各地域の血液センターのウェブサイトでは、以下のような献血状況が公表されています。献血方法には、400mL、200mL、そして成分献血の3種類があり、血液型ごと（A型、O型、B型、AB型）に対して、「安心です」「心配です」「困っています」「非常に困っています」という4段階で献血状況が報告されています。これらの情報をスクレイピングして取得します。

BeautifulSoupとrequestsを使ってスクレイピングをして、CSVファイルに書き込む処理を行います。

ライブラリのインポート

import csv

import requests
from bs4 import BeautifulSoup

WebサイトのURLをリストに格納する

北海道、東北、関東甲信越、東海北陸、近畿、中四国、九州ブロックの7ブロックの献血センターに対して、それぞれのサイトのURLを表すurlと地域コードを表すcodeを割り当てます。また、開発者ツールなどを用いて抽出したい<p>タグのクラス名を特定します。

sites = [
    {'url':'https://www.bs.jrc.or.jp/hkd/hokkaido/index.html', 'code': 1, 'class': 'center-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/th/bbc/index.html', 'code': 2, 'class': 'block-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/ktks/bbc/index.html', 'code': 3, 'class': 'block-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/tkhr/bbc/index.html', 'code': 4, 'class': 'block-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/kk/bbc/index.html', 'code': 5, 'class': 'block-main-today-types-state'},
    {'url':'https://www.bs.jrc.or.jp/csk/bbc/index.html', 'code': 6, 'class': 'block-main-today-types-state'}, 
    {'url':'https://www.bs.jrc.or.jp/bc9/bbc/index.html', 'code': 7, 'class': 'block-main-today-types-state'}
]

スクレイピング処理
response.txetでは文字化けしてしまったのでresponse.contentとしました。

for site in sites:
    response = requests.get(site['url'])
    soup = BeautifulSoup(response.content, "html.parser")
    
    # pタグをすべて取得
    p_tags = soup.find_all('p')
    # それらの中から、指定されたクラスを持つタグだけを抽出
    target_p_tags = [tag for tag in p_tags if site['class'] in tag.get('class', [])]
    target_elements = [tag.text for tag in target_p_tags if tag.text.strip()]

CSVファイルに書き込む
1~3の手順にCSVファイルにスクレイピング結果を書き込む処理を追加します。ここではBloodStock.csvというファイルを用意しました。

# ヘッダー
header = ['block_code', '400-a', '400-o', '400-b', '400-ab', '200-a', '200-o', '200-b', '200-ab', 'com-a', 'com-o', 'com-b', 'com-ab']

# ヘッダーを書き込む（'w'で上書きを指定）
with open('BloodStock.csv', 'w', newline='', encoding='utf-8') as csvfile:
     writer = csv.writer(csvfile)
     writer.writerow(header)

# 取得したデータを書き込む（'a'で追加書き込みを指定）
def write_to_csv(code, data, filename):
    with open(filename, 'a', newline='', encoding='utf-8') as csvfile:
         writer = csv.writer(csvfile)
         writer.writerow([code] + data)

for site in sites:
    response = requests.get(site['url'])
    soup = BeautifulSoup(response.content, "html.parser")

    # pタグをすべて取得
    p_tags = soup.find_all('p')
    # それらの中から、指定されたクラスを持つタグだけを抽出
    target_p_tags = [tag for tag in p_tags if site['class'] in tag.get('class', [])]
    target_elements = [tag.text for tag in target_p_tags if tag.text.strip()]

    # CSVファイルに書き込む
    write_to_csv(site['code'], target_elements, 'BloodStock.csv')

CSVファイルにスクレイピング結果を保存することができました。

GitHub Actionsで自動化

GitHub Actionsのワークフローを定義するymlファイルは、リポジトリのルートディレクトリに.githubディレクトリを作成し、その中にworkflowsというファイルを作成します。このファイルは必ずworkflowsという名前で保存する必要があります。

ymlファイル

name：アクションの名前を入力。リポジトリ内で一意の名前を付けてあげます。
```
name: Scrape
```
on：アクションのトリガーを設定
```
on:
  workflow_dispatch:
  schedule:
    - cron: "0 15 * * 2,5"
```
workflow_dispatch: 手動でワークフローを起動するためのトリガーです。テストをする際など手動でワークフローを起動できます。
Actionsを選んで「Run workflow」をクリックすると実行できます。

schedule: キーとその下に指定されたcronスケジュールは、GitHub Actionsのワークフローを定期的に実行するためのもので、Cron式を使用して定期的な実行のタイミングを指定します。
Cron式は以下のような構造になっています

＊＊＊＊＊

分時日月曜日

0 - 59 0 - 23(UTC時間) 1 - 31 1 - 12 0 - 7(0,7が日曜)

注意したいのが、時間はUTC時間であることです。
UTC (協定世界時)は、世界中の標準的な時間を示すために使用される時間で,原子時計に基づいており、地球の回転の影響を受けないため、非常に正確な時間を提供します。UTC = JST - 9時間で表現することができます。現在はUTCに置き換わりつつありますが、GMTと同じ時間を示します。
ここでは、毎週水曜日と金曜日の午前0時に実行したいので”0 15 * * 2,5”としています。
jobs:ジョブを設定していきます
```
jobs:
	Scrape:
		runs-on: ubuntu-latest
```
runs-on: ジョブを動作させる仮想環境をUbuntu最新版に指定しています。

＊	＊	＊	＊	＊
分	時	日	月	曜日
0 - 59	0 - 23(UTC時間)	1 - 31	1 - 12	0 - 7(0,7が日曜)

steps:

steps:
        # GitHubリポジトリをチェックアウトする
      - name: Checkout repository
        uses: actions/checkout@v2

        # バージョン3.9でセットアップする
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: "3.9"

        # requirements.txtに書いたライブラリをインストールする
      - name: Install dependencies
        run: |
          pip install -r requirements.txt

        # robot.pyを実行する
      - name: Run scraping script
        run: |
          python robot.py
        
        # 更新されたデータをdevelopブランチにプッシュする
      - name: commit files
        run: |
          git config --global user.name "${{ github.actor }}"
          git config --global user.email "${{ github.actor }}@users.noreply.github.com"
          git add BloodStock.csv
          git commit -m 'update BloodStock.csv'
          git push origin develop

"${{ github.actor }}","${{ github.actor }}@users.noreply.github.com"とすることで実行者のGitHubアカウントでコミットを行います。

403エラー

fatal: unable to access 'https://github.com/******/*****/': The requested URL returned error: 403
Error: Process completed with exit code 128.

コードを実行しようとしたところ、アクセスできないと言われ、403エラーが出てしまいました。

調べてみると、GitHub Actionsでの書き込み権限がないようなので書き込み権限を与えます。

SettingsタブからActionsのGeneralを選択します。Workflow permissionsをRead and write permissionsに変更し、Saveすればエラーは解消されました。

定期実行されCSVファイルが自動で更新されるようになりました。

おわりに

GitHub Actionsで簡単にスクレイピングを自動実行できました。スクレイピングで一度に大量のデータを取得するとサーバーに負担をかけてしまう可能性があるので気を付けたいです。

参考にしたサイト等

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up