1
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

JavaScriptで描画されたページをSplashとBeautifulSoupで取得する

Last updated at Posted at 2022-07-31

BeautifulSoupではJavaScriptで描画される動的な要素を取得できません。
しかし、Splashサーバを経由することでJavaScript実行済みのレスポンスを取得できるようになります。

Splashのイメージ

BeautifulSoupを使った通常のパターン

code1.py
import requests
from bs4 import BeautifulSoup

if __name__ == "__main__":
    # 取得元サイト
    url = "https://www.travel.co.jp/stay/city/kusatsu-100020271"
    # リクエスト送信
    res = requests.get(url)
    # HTML解析
    soup = BeautifulSoup(res.content, "html.parser")
    
    # 取得できた要素0件
    result_items = soup.select("div.result > div")
    print(f"{len(result_items) = }")

    for item in result_items:
        hotel_name = item.select_one("p.result_item_hotel_name a").text
        print(f"{hotel_name = }")

動的に描画される要素は取得できません。

$ python code1.py 
len(result_items) = 0

Splashを使った場合

docker-composeでsplashを起動しておきます。

docker-compose.yml
splash:
  image: scrapinghub/splash
  ports:
    - 8050:8050
$ docker-compose up -d    
Creating splash-bs4_splash_1 ... done
code2.py
import requests
from bs4 import BeautifulSoup

if __name__ == "__main__":
    # 取得元URL
    url = "https://www.travel.co.jp/stay/city/kusatsu-100020271/"
    # Splashサーバーを経由してリクエストを送る
    res = requests.get("http://localhost:8050/render.html", {"url": url, "wait": 0.5})
    # JavaScirpt実行し描画済みのHTML解析
    soup = BeautifulSoup(res.content, "html.parser")

    # 10件取得できた
    result_items = soup.select("div.result > div")
    print(f"{len(result_items) = }")

    for item in result_items:
        hotel_name = item.select_one("p.result_item_hotel_name a").text
        print(f"{hotel_name = }")

JavaScript実行済みのHTMLをレスポンスで受け取ることで要素を取得できます。

$ python code2.py 
len(result_items) = 10
hotel_name = '草津温泉\u3000望雲'
hotel_name = '草津ナウリゾートホテル'
hotel_name = '草津温泉\u3000ホテル櫻井'
hotel_name = 'ホテル一井'
hotel_name = '草津ホテル'
hotel_name = '草津温泉\u3000ペンションヴァンベール'
hotel_name = '草津温泉\u3000お豆の小宿\u3000花いんげん'
hotel_name = '草津温泉\u3000益成屋旅館'
hotel_name = '湯畑草菴'
hotel_name = '草津温泉\u3000薬師の湯\u3000湯元館'
1
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?