More than 5 years have passed since last update.

【Python】HeadlessChromeに代わるクロール用ブラウザSplash

Last updated at 2019-05-08Posted at 2019-04-06

Splashとは

Scrapyの開発元であるscrapinghub社が開発したクローリングに特化したヘッドレスブラウザです。scrapy-splashモジュールを使用することで，Scrapyと連動させることもできます。できることとしては，

JavaScript適用後のHTML、ページのスクリーンショットなどの取得
レンダリング情報をHARフォーマットで取得
複数のページの並列処理
ページに独自のJavaScriptコードを適用
Luaによるブラウジング用スクリプトの記述

などが挙げられます。

Splashインストール

docker をインストール
　
dockerイメージをpull

# Linux
sudo docker pull scrapinghub/splash
# Mac
docker pull scrapinghub/splash

　
3. コンテナを作成・起動

# Linux
sudo docker run -it -p 8050:8050 scrapinghub/splash
# Mac
docker run -it -p 8050:8050 scrapinghub/splash

http://localhost:8050/ にアクセスして起動確認

※ Ctrl-Cで終了

HTTP API

Splashでは，HTTP APIを使うことで簡単にページ情報を取得できます。ここではその一例を紹介しますが，より詳しく知りたい方は公式ドキュメントを御覧ください。

render.html

ページのhtmlを取得

import requests

url = 'https://qiita.com/derodero24'
res = requests.get('http://localhost:8050/render.html',
                   params={'url': url, 'wait': 0.5})

with open('test.html', 'wb') as f:
    f.write(res.content)

オプション（一部）

url：表示するURL
timeout：レンダリングのタイムアウト[sec]。（デフォルト 30）デフォルトの最大値は90秒だが，Splash起動時のコマンドラインオプション--max-timeoutで上書き可能。
wait：ページ読み込み後の待ち時間[sec]（デフォルト 0）。timeoutの値より小さくなくてはいけない。
js_source：ページ内で実行されるJavaScriptコード
viewport：ブラウザの表示サイズ，"<横幅>x<縦幅>"（デフォルト "1024x768"）
images：画像を読み込むか，0 または 1（デフォルト 1）

render.png

スクリーンショットをPNG形式で取得

import requests

url = 'https://qiita.com/derodero24'
res = requests.get('http://localhost:8050/render.png',
                   params={'url': url, 'wait': 1, 'viewport': '2560x1600', 'render_all': 1})

with open('test.png', 'wb') as f:
    f.write(res.content)

オプション（一部）

render.htmlのオプションに加えて以下が設定可能

width：画像横幅[px]，アスペクト比固定でリサイズ
height：画像縦幅[px]，ページトップからトリミング
render_all：ページ全体のスクリーンショットか，0 または 1（デフォルト 0）
scale_method：データ形式，"raster" または "vector"（デフォルト "raster"）

render.jpeg

スクリーンショットをjpeg形式で取得

import requests

url = 'https://qiita.com/derodero24'
res = requests.get('http://localhost:8050/render.jpeg',
                   params={'url': url, 'wait': 1, 'width': 1000, 'quality': 100})

with open('test.jpg', 'wb') as f:
    f.write(res.content)

オプション（一部）

render.pngのオプションに加えて以下が設定可能

quality：jpeg画像のクオリティ，0~100（デフォルト75）

render.har

レンダリング情報をhar形式で取得

import requests

url = 'https://qiita.com/derodero24'
res = requests.get('http://localhost:8050/render.har',
                   params={'url': url, 'wait': 0.5})

with open('test.har', 'wb') as f:
    f.write(res.content)

オプション（一部）

render.htmlのオプションに加えて以下が設定可能

request_body：harファイルにリクエストコンテンツを含むか，0 または 1（デフォルト 0）
response_body：harファイルにレスポンスコンテンツを含むか，0 または 1（デフォルト 0）

render.json

ページタイトルやhtml，スクリーンショット画像などの情報をjson形式でまとめて取得
※ 画像情報はBase64にエンコードされているので注意

import requests
import base64

url = 'https://qiita.com/derodero24'
res = requests.get('http://localhost:8050/render.json',
                   params={'url': url, 'wait': 0.5, 'html': 1, 'jpeg': 1, 'height': 800})

data = res.json()

# ページタイトル
print(data['title'])

# htmlファイルを保存
with open('test.html', 'w') as f:
    f.write(data['html'])

# jpegファイルを保存
jpeg_data = base64.b64decode(data['jpeg'].encode())
with open('test.jpg', 'wb') as f:
    f.write(jpeg_data)

オプション（一部）

render.jpegのオプションに加えて以下が設定可能

html：html情報を含むか，0 または 1（デフォルト 0）
png：png情報を含むか，0 または 1（デフォルト 0）
jpeg：jpeg情報を含むか，0 または 1（デフォルト 0）
har：har情報を含むか，0 または 1（デフォルト 0）
iframes：iframes情報を含むか，0 または 1（デフォルト 0）

execute

Luaスクリプトを使ったより柔軟なブラウジング

import requests

# ページタイトルを取得するLuaスクリプト
lua_source = '''
function main(splash, args)
  splash:go("https://qiita.com/derodero24")
  splash:wait(0.5)
  local title = splash:evaljs("document.title")
  return {title=title}
end
'''

res = requests.get('http://localhost:8050/execute',
                   params={'lua_source': lua_source})
print(res.json()['title'])

オプション（一部）

lua_source：ブラウジング用のLuaスクリプト

run

基本的には excute と同じだが，run の場合はLuaスクリプトが自動的にfunction main(splash, args) ... endで囲まれる。

import requests

# ページタイトルを取得するLuaスクリプト
lua_source = '''
splash:go("https://qiita.com/derodero24")
splash:wait(0.5)
local title = splash:evaljs("document.title")
return {title=title}
'''

res = requests.get('http://localhost:8050/run',
                   params={'lua_source': lua_source})
print(res.json()['title'])

参考

https://splash.readthedocs.io/en/stable/install.html
https://github.com/scrapinghub/splash

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【Python】HeadlessChromeに代わるクロール用ブラウザSplash

目次

Splashとは

Splashインストール

HTTP API

render.html

オプション（一部）

render.png

オプション（一部）

render.jpeg

オプション（一部）

render.har

オプション（一部）

render.json

オプション（一部）

execute

オプション（一部）

run

参考