More than 3 years have passed since last update.

Pythonでスクレイピング

Posted at 2020-08-15

Pythonでスクレイピングする必要があったので、メモしておきます。
Dockerで環境を整えて実施します。

構成

$ ls
README.md  docker-compose.yaml  scraping
$ ls scraping/
Dockerfile  requirements.txt  scrap.py  scraping.py

docker-compose.yaml

version: '3.8'

services:
  scraping:
    build: ./scraping

Dockerfile

FROM python:latest

COPY . /work
WORKDIR /work

RUN apt-get update

# beautiful soupをインストール
RUN pip install -U pip
RUN pip install -r requirements.txt

ENTRYPOINT ["python"]
CMD ["scrap.py"]

requirements.txt

bs4
requests

適当に h1 で取得

scrap.py

import requests
from bs4 import BeautifulSoup

url = "https://www.yahoo.co.jp"
response = requests.get(url)

soup = BeautifulSoup(response.text,"html.parser")

titles = soup.find_all("h1")

for title in titles:
    print(title.text)

実行結果

$ docker-compose up --build
.
.
.
scraping_1  | Yahoo! JAPAN
scraping_1  | 検索
scraping_1  | JavaScriptの設定について
scraping_1  | 推奨ブラウザーについて
scraping_1  | お知らせ
scraping_1  | 主なサービス
scraping_1  | ニュース
scraping_1  | 主要 ニュース
scraping_1  | 天皇陛下「深い反省」今年も
scraping_1  | さびしい 戦地から届いた恋文
scraping_1  | 戦わず死者5000 見放された島
scraping_1  | 午後は災害級の暑さ 警戒を
scraping_1  | 元徴用工 文氏「日本と努力」
scraping_1  | ローソン 印紙購入横行の背景
scraping_1  | 速報交流試合 磐城vs.国士舘
scraping_1  | 履正社 一度も負けず夏終える
scraping_1  | 追悼式で黙とう
scraping_1  | 個人に関わる情報
scraping_1  | あなたのステータス
scraping_1  | 今日の日付
b-model_scraping_1 exited with code 0

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up