More than 5 years have passed since last update.

公開されたcsvをGithub ActionでスクレイピングしてGithub Pagesで公開する

Last updated at 2020-04-16Posted at 2020-04-16

はじめに

このプログラムについて

岐阜県から公開されたオープンデータ(csv)を、
・github actionsで定期的にスクレイピングし、
・単純な辞書配列として、無編集状態でjsonファイルを出力
・差分があればgh-pagesブランチにpush
・github pagesで直接jsonファイルにアクセスできる
プログラムです。

公開の経緯

岐阜県コロナウイルス対策サイト開発にあたり、本プログラムを開発。
他の事例でも公開されていますが、csv->json出力するにあたって加工処理が入っており、
参考にするには修正が多く必要でした。
そこで、本プログラムでは最小限の加工にとどめ、もとのcsvデータをそのままjson出力できるような形式にすることで、他の開発者が開発しやすくしています。

Product

Github

Github pagesでのJson出力

http://code-for-gifu.github.io/covid19-scraping/patients.json
http://code-for-gifu.github.io/covid19-scraping/testcount.json
http://code-for-gifu.github.io/covid19-scraping/callcenter.json
http://code-for-gifu.github.io/covid19-scraping/advicecenter.json

参考CSVファイル

岐阜県オープンデータ
https://data.gifu-opendata.pref.gifu.lg.jp/dataset/c11223-001

how to use

Github上で動作させる

起動方法

自分の環境にfork
github/workflows/main.ymlに記載されたアクションが定時(10分毎)に自動で起動します
手動実行はできません

停止方法

.github/workflows/main.ymlを削除する
もしくは、
.github/workflows/main.ymlの、以下の3行をコメントアウトする

main.yml

on:
  schedule:
    - cron: "*/10 * * * *”

Github Pagesでのホスティング

Settings -> Github Pages -> Sourceで gh-pages branchを選択。

詳細は、github actionの公式ドキュメントを参考にしてください。
https://help.github.com/ja/actions

ローカル環境での実行

pip install -r requirements.txt
python3 main.py

/data フォルダにjsonファイルが生成されます。

技術文書

※コード全体はgithubのソースコードを参照してください

python

メイン

main.py

os.makedirs('./data', exist_ok=True)
for remotes in REMOTE_SOURCES:
    data = import_csv_from(remotes['url'])
    dumps_json(remotes['jsonname'], data)

別ファイルで定義したcsvリストをすべて読み込み、jsonファイルを出力する

データ定義部

settings.py

# 外部リソース定義
REMOTE_SOURCES = [
    {
        'url': 'https://opendata-source.com/source1.csv',
        'jsonname': 'source1.json',
    },
    {
        'url': 'https://opendata-source.com/source2.csv',
        'jsonname': 'source2.json',
    },
    {
        'url': 'https://opendata-source.com/source3.csv',
        'jsonname': 'source3.json',
    },
    {
        'url': 'https://opendata-source.com/source4.csv',
        'jsonname': 'source4.json',
    }
]

url:参照しているcsvのリンクを貼る
json_name:出力されるjsonファイル名称

csv読み込み部

main.py

def import_csv_from(csvurl):
    request_file = urllib.request.urlopen(csvurl)
    if not request_file.getcode() == 200:
        return

    f = decode_csv(request_file.read())
    filename = os.path.splitext(os.path.basename(csvurl))[0]
    datas = csvstr_to_dicts(f)
    timestamp = (request_file.getheader('Last-Modified'))

    return {
        'data': datas,
        'last_update': dateutil.parser.parse(timestamp).astimezone(JST).isoformat()
    }

csvのアクセスはurllibを活用している。
data:csvをデコードしたデータそのものを格納する。
last_update：ファイルの最終更新日付を取得。

csvデコード部

main.py

def decode_csv(csv_data):
    print('csv decoding')
    for codec in CODECS:
        try:
            csv_str = csv_data.decode(codec)
            print('ok:' + codec)
            return csv_str
        except:
            print('ng:' + codec)
            continue
    print('Appropriate codec is not found.')

別ファイルで定義したコーデックを順番に試す

csv→jsonデータ変換部

main.py

def csvstr_to_dicts(csvstr):
    datas = []
    rows = [row for row in csv.reader(csvstr.splitlines())]
    header = rows[0]
    for i in range(len(header)):
        for j in range(len(UNUSE_CHARACTER)):
            header[i] = header[i].replace(UNUSE_CHARACTER[j], '')

    maindatas = rows[1:]
    for d in maindatas:
        # 空行はスキップ
        if d == []:
            continue
        data = {}
        for i in range(len(header)):
            data[header[i]] = d[i]
        datas.append(data)
    return datas

CSV文字列を[dict]型に変換
keyとして使用できない文字を単純置換削除

jsonデータ出力部

main.py

def dumps_json(file_name: str, json_data: Dict):
    with codecs.open("./data/" + file_name, "w", "utf-8") as f:
        f.write(json.dumps(json_data, ensure_ascii=False,
                           indent=4, separators=(',', ': ')))

日本語文字化け対策などを施したdump jsonキット

Github Action

ymlファイルで構築している。

スケジュール

main.yml

on:
  　　schedule:
    　- cron: "*/10 * * * *”

定期実行。現在は10分間隔

pythonスクリプト実行部

main.yml

    steps:
      - uses: actions/checkout@v2
      - name: Set up Python 3.8
        uses: actions/setup-python@v1
        with:
          python-version: 3.8
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run script
        run: |
          python main.py

python環境をrequirements.txtに記載、自動でインストール
インストール後、自動でmain.pyを起動

gh-pagesにpush

main.yml

      - name: deploy
        uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./data
          publish_branch: gh-pages

実行結果のjsonファイルを特定のブランチに自動的にpushする
secrets.GITHUB_TOKENは自分自身を表す
publish_dirは出力するフォルダの設定。jsonファイルを出力するdataフォルダを指定。
publish_branchはpushするbranchを指定

参考文献

北海道：スクレイピング用Pythonスクリプト - covid19hokkaido_scraping
https://github.com/Kanahiro/covid19hokkaido_scraping/blob/master/main.py

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up