3
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Pythonでスクレイピング

Posted at

Pythonでスクレイピングする必要があったので、メモしておきます。
Dockerで環境を整えて実施します。

構成
$ ls
README.md  docker-compose.yaml  scraping
$ ls scraping/
Dockerfile  requirements.txt  scrap.py  scraping.py
docker-compose.yaml
version: '3.8'

services:
  scraping:
    build: ./scraping
Dockerfile
FROM python:latest

COPY . /work
WORKDIR /work

RUN apt-get update

# beautiful soupをインストール
RUN pip install -U pip
RUN pip install -r requirements.txt

ENTRYPOINT ["python"]
CMD ["scrap.py"]
requirements.txt
bs4
requests

適当に h1 で取得

scrap.py
import requests
from bs4 import BeautifulSoup

url = "https://www.yahoo.co.jp"
response = requests.get(url)

soup = BeautifulSoup(response.text,"html.parser")

titles = soup.find_all("h1")

for title in titles:
    print(title.text)
実行結果
$ docker-compose up --build
.
.
.
scraping_1  | Yahoo! JAPAN
scraping_1  | 検索
scraping_1  | JavaScriptの設定について
scraping_1  | 推奨ブラウザーについて
scraping_1  | お知らせ
scraping_1  | 主なサービス
scraping_1  | ニュース
scraping_1  | 主要 ニュース
scraping_1  | 天皇陛下「深い反省」今年も
scraping_1  | さびしい 戦地から届いた恋文
scraping_1  | 戦わず死者5000 見放された島
scraping_1  | 午後は災害級の暑さ 警戒を
scraping_1  | 元徴用工 文氏「日本と努力」
scraping_1  | ローソン 印紙購入横行の背景
scraping_1  | 速報交流試合 磐城vs.国士舘
scraping_1  | 履正社 一度も負けず夏終える
scraping_1  | 追悼式で黙とう
scraping_1  | 個人に関わる情報
scraping_1  | あなたのステータス
scraping_1  | 今日の日付
b-model_scraping_1 exited with code 0
3
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?