More than 5 years have passed since last update.

PythonのBeautifulSoupによるWebサイトのスクレイピング

Posted at 2020-05-13

この記事について

以前Qiitaにも書いたが、JavaでWebサイトのスクレイピングを行うコードを書いていたのだが
今読み返してみると、要件は満たしているもののコードの内容がきれいとは言い難く
見ていて恥ずかしくなってしまったので、Pythonで書き直すことにしたのでメモ。

Qiita内にも類似の記事は多数あるが、覚書ということで。

BeautifulSoupについて

Javaによるスクレイピングを行う際にはjsoupというライブラリを使用していたが
今回はBeautifulSoupを使用する。

BeautifulSoupはPythonのスクレイピング用のライブラリである。
CSSセレクタを用いてページ内の要素を抽出できるので、ページ内の欲しいデータだけを抽出するのに便利である。
公式：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Pythonのライブラリなので、導入はpipで行う。

pip install beautifulsoup4

使用例

以前書いた記事と同様に、以下のページから「お知らせ」の日付・タイトル・URLを抽出したい。

<body> 
 <div class="section"> 
  <div class="block"> 
   <dl>
    <dt>2019.08.04</dt> 
    <dd>
     <a href="http://www.example.com/notice/0003.html">お知らせ その3</a>
    </dd> 
    <dt>2019.08.03</dt> 
    <dd>
     <a href="http://www.example.com/notice/0002.html">お知らせ その2</a>
    </dd> 
    <dt>2019.08.02</dt> 
    <dd>
     <a href="http://www.example.com/notice/0001.html">お知らせ その1</a>
    </dd> 
   </dl>
  </div>
 </div>
</body>

以下のコードにてお知らせの抽出を行い、printする。

scraping.py

# -*- coding: utf-8 -*-
import requests
import sys
from bs4 import BeautifulSoup
from datetime import datetime as d

def main():

    print("Scraping Program Start")

    # 指定のURLに対してGETリクエストを送信し、ページの内容を取得
    res=requests.get('http://www.example.com/news.html')

    # 取得したHTMLページをBeautifulSoupオブジェクトにパース
    soup = BeautifulSoup(res.text, "html.parser")

    # ページ内のblockクラスの要素を丸ごと抽出
    block = soup.find(class_="block")

    # blockクラス内のdt要素（日付）とdd要素を抽出
    dt = block.find_all("dt")
    dd = block.find_all("dd")

    if(len(dt) != len(dd)):
        print("ERROR! The number of DTs and DDs didn't match up.")
        print("Scraping Program Abend")
        sys.exit(1)

    newsList = []

    for i in range(len(dt)):
        try:
            date = dt[i].text
            title = dd[i].find("a")
            url = dd[i].find("a").attrs['href']

            print("Got a news. Date:" + date +", title:" + title.string + ", url:" + url)

        except:
            print("ERROR! Couldn't get a news.")
            pass

    print("Scraping Program End")

if __name__ == "__main__":
    main()

上記コードを実行した際の期待結果は以下。

Scraping Program Start
Got a news. Date:2019.08.04, title:お知らせ その3, url:http://www.example.com/notice/0003.html
Got a news. Date:2019.08.03, title:お知らせ その2, url:http://www.example.com/notice/0002.html
Got a news. Date:2019.08.04, title:お知らせ その1, url:http://www.example.com/notice/0001.html
Scraping Program End

おわりに

前回JavaのSpring Bootで書いた時と比べて、Pythonだとコーディング量が圧倒的に少なく済むのがいいですね。
内容の間違い等あれば、ご指摘願います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up