More than 5 years have passed since last update.

Python+SeleniumでChromeデベロッパーツールのNetworkタブ相当の情報を取得する

Last updated at 2019-11-23Posted at 2019-11-23

#やりたいこと
Chromeのデベロッパーツール（WindowsではCtl+Shift+iで開くやつ）のNetworkタブは、ブラウザが取得したデータのタイムラインが見れたり、回線速度のシミュレーションができたりと色々面白いツールです。

今回はシンプルに、このNetworkタブで表示されるファイルのURLリストをPython + Seleniumで取得してみます。

#環境
Chrome 79.0.3945.45 beta
Python 3.7.3
selenium 3.141.0
chromedriver-binary 79.0.3945.36.0

Debian GNU/Linux 9
(Docker container)

#実装

Seleniumによるページの取得までは以下のようにしています。
optionsはヘッドレスモード等、適当に設定します。
driver.get()でページを取得しますが、このへんの基礎知識はこちらの秀逸な記事が大変参考になりました。

Python + Selenium で Chrome の自動操作を一通り

netlogs.py

caps = DesiredCapabilities.CHROME
caps["goog:loggingPrefs"] = {"performance": "ALL"} 
# caps["loggingPrefs"] = {"performance": "ALL"} 

# options
options = ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--user-agent='+_headers["User-Agent"])

# get driver
driver = Chrome(options=options, desired_capabilities=caps)
driver.implicitly_wait(5)
driver.get("https://qiita.com/")

URLを含むログはperformanceという名前なので、当該ログを取得するようDesiredCapabilitiesに設定して¹
driverを取得するときにこれを渡してあげます²。

DesiredCapabilitiesの設定名は環境によって、
"goog:loggingPrefs"ではなく"loggingPrefs"じゃないと動かないケースがありました。
Chromeバージョンによって異なるのでしょうか…？

netlogs.py

time.sleep(2)

ページが読み込まれるまで待ってあげます。
driver.implicitly_wait()で待つのがセオリーらしいですが、
所望するデータが上手く取れなかったのでsleepを入れています。
もっとスマートな方法があれば教えてください…。

netlogs.py

netLog = driver.get_log("performance")

driver.get_log("performance")で取得したログはJSON様のフォーマットで、次のようなものになります。

performance

[
    {'level': 'INFO', 'message': '{
            "message": {
                "method": "Page.frameResized",
                "params": {}
            },
            "webview": "***"
        }', 'timestamp': ***
    },
    {'level': 'INFO', 'message': '{

    ...

取得したperformanceログから必要な部分だけを抽出しにいきます。

netlogs.py

def process_browser_log_entry(entry):
    response = json.loads(entry['message'])['message']
    return response

events = [process_browser_log_entry(entry) for entry in netLog]
events = [event for event in events if 'Network.response' in event['method']]

detected_url = []
for item in events:
    if "response" in item["params"]:
        if "url" in item["params"]["response"]:
            detected_url.append(item["params"]["response"]["url"])

"message"というプロパティのうち、更に"method"名にNetwork.responseReceivedを含むものを選択的に抽出します。
そうすると、抽出したeventsは次のようなitemの集合になります。
あとは"params"=>"response"に"url"を含むitemを見つけて抽出し、detected_urlに格納しています。

network.response

[
    {
        "method": "Network.responseReceivedExtraInfo",
        "params": {
            "blockedCookies": [],
            "headers": {
                "cache-control": "max-age=0, private, must-revalidate",
                "content-encoding": "gzip",
                "content-type": "text/html; charset=utf-8",
                "date": "Sat, 23 Nov 2019 07:41:40 GMT",
                "etag": "W/\"***\"",
                "referrer-policy": "strict-origin-when-cross-origin",
                "server": "nginx",
                "set-cookie": "***",
                "status": "200",
                "strict-transport-security": "max-age=2592000",
                "x-content-type-options": "nosniff",
                "x-download-options": "noopen",
                "x-frame-options": "SAMEORIGIN",
                "x-permitted-cross-domain-policies": "none",
                "x-request-id": "***",
                "x-runtime": "***",
                "x-xss-protection": "1; mode=block"
            },
            "requestId": "***"
        }
    },
    {
    ...

##コード全体

netlogs.py

caps = DesiredCapabilities.CHROME
caps["goog:loggingPrefs"] = {"performance": "ALL"}

options = ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--user-agent='+_headers["User-Agent"])

driver = Chrome(options=options, desired_capabilities=caps)
driver.implicitly_wait(5)
driver.get("https://qiita.com/")

time.sleep(2)

netLog = driver.get_log("performance")

def process_browser_log_entry(entry):
    response = json.loads(entry['message'])['message']
    return response
events = [process_browser_log_entry(entry) for entry in netLog]
events = [event for event in events if 'Network.response' in event['method']]

detected_url = []
for item in events:
    if "response" in item["params"]:
        if "url" in item["params"]["response"]:
            detected_url.append(item["params"]["response"]["url"])

#他の方法
スクリプトを実行して上記相当の情報を取得することもできるようです³。

netlogs_js.py

scriptToExecute = "var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return JSON.stringify(network);"
netData = driver.execute_script(scriptToExecute)
netJson = json.loads(str(netData))

detected_url = []
for item in netJson:
    detected_url.append(item["name"])

こちらの方法でもURLの一覧情報は取得できました。

しかし、目的のファイルが含まれないときがあり、安定した方法ではないような気がします。
（ちゃんと検証はしていません）

もっといい方法があればご指摘願います！

こちらを参考にしました（ほぼコピペ）- Selenium - python. how to capture network traffic's response [duplicate] ↩
Selenium API docs - selenium.webdriver.chrome.webdriver ↩
JSを投げて取得する方法が紹介されています - Using Selenium how to get network request ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up