More than 1 year has passed since last update.

Python3で静的Webページをスクレイピング

Last updated at 2022-10-30Posted at 2022-10-29

開発環境、前提

macOS Ventura 13.0
Rancher Desktop 1.6.1
前提：dockerコマンド、docker-composeコマンドを使える
ターミナル、またはVisual Studio Code

やること

入力：Webページのリンクを記載したテキスト（※）
出力：パースした情報を記載したテキスト

処理

各行のWebページのhtmlをダウンロードする
HTMLをダウンロードする
いい感じにパースして必要な情報をパースする

※ 注意

この記事では、実験的にYahoo!ファイナンスの日経平均株価とダウ平均を取得してみる。
一般的に、Webページを定期的に／網羅的に／大量にスクレイピングする場合、国内法や利用規約に準拠しているか、確認する必要がある。
筆者はこの記事で、Yahoo!ファイナンスを定期的に／網羅的に／大量にスクレイピングして良いかどうか確認していない。推奨もしていない。

ディレクトリ構造

scraping-py3 % tree 
.
├── Dockerfile
├── docker-compose.yml
├── out
│   └── （実行すると　scrape-result.txt　が増える。）
├── scraping-py3.code-workspace（Visual Studio Code の設定ファイル。直接の関係はない。）
└── src
    ├── links.txt
    └── scrape.py3

各ファイル

Dockerfile

FROM python:3.11-buster

RUN apt-get update
RUN pip install requests beautifulsoup4

RUN useradd app --create-home	
USER app
WORKDIR /home/app

Composeファイル Version 3

docker-compose.yml

services:
  scrape:
    build: 
      context: "."
      dockerfile: "./Dockerfile"
    container_name: "scrape-c"
    tty: true # keep container running
    volumes:
      - type: "bind"
        source: "./src"
        target: "/home/app/src"
      - type: "bind"
        source: "./out"
        target: "/home/app/out"

スクレイピングする対象のWebページ

links.txt

https://finance.yahoo.co.jp/quote/998407.O
https://finance.yahoo.co.jp/quote/^DJI

scrape.py3

import requests
from bs4 import BeautifulSoup

def main():
    print('Hello, scrape!')
    scrape()
    print('Bye, scrape!')

def scrape():
    workdir = "/home/app"

    # read links.txt
    with open(workdir + "/src/links.txt") as f:
        links = f.readlines()

    # write scrape-result.txt
    file = open(workdir + "/out/scrape-result.txt", "w")
    file.close()
    file = open(workdir + "/out/scrape-result.txt", "a")
    for link in links:
        parsed = parse(link)
        file.write(parsed)
        file.write('\n')
    file.close()

def parse(link: str) -> str:
    response = requests.get(link, [])
    soup = BeautifulSoup(response.text, "html.parser")

    elems = soup.find_all("span")
    getStr = lambda elem: elem.getText()
    spanTexts = map(getStr, elems)
    addToPf = list(spanTexts).index("ポートフォリオに追加")
    parse = getStr(elems[addToPf - 2]) + " " + getStr(elems[addToPf - 1])

    # Webページの調査用    
    # parse = "\n".join(spanTexts)

    return parse

if __name__ == '__main__':
    main()

実行

zsh ターミナル

docker-compose up --build

control+C を押下するまでコンテナが起動し続ける。

　別のターミナルを開いて、起動中のコンテナの中でbashを起動

zsh ターミナル

docker exec -it scrape-c /bin/bash

コンテナのbash

python3 /home/app/src/scrape.py3

/home/app/out/scrape-result.txt が作成される。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up