More than 3 years have passed since last update.

[Python]Seleniumを利用したWebページのPDF保存方法メモ

Posted at 2021-06-05

Python + Seleniumを利用して複数のWebページにアクセスし、それらのページをPDF保存する方法をメモする。
- Windows環境でGoogle Chromeを利用して実施する。

事前準備

**こちら**の手順でSelenium,Chromeドライバーをインストールしておく。

処理フロー

以下のような対象WebページのURL一覧(urls.txt)を用意する。
```
https://hogehoge.com/hoge.html
https://fugafuga.com/fuga.html
...
```
上記ファイルに記載のURLにアクセスする。
アクセスしたURLをPDF保存する。
１，２を繰り返す。

コード

※有料ジャーナルなどを対象に実行しないこと。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pathlib import Path
import time
import json

# 保存対象URL一覧取得
urls = []
with open('urls.txt', mode='rt', encoding='utf-8') as f:
    urls = f.readlines()

# Chrome の印刷機能でPDFとして保存
options = webdriver.ChromeOptions()
# PDF印刷設定
appState = {
    "recentDestinations": [
        {
            "id": "Save as PDF",
            "origin": "local",
            "account": ""
        }
    ],
    "selectedDestinationId": "Save as PDF",
    "version": 2,
    "pageSize": 'A4'
}
# ドライバへのPDF印刷設定の適用
options.add_experimental_option("prefs", {
    "printing.print_preview_sticky_settings.appState":
    json.dumps(appState),
    "download.default_directory": '~/Downloads'
})
options.add_argument('--kiosk-printing')

with webdriver.Chrome("./chromedriver.exe", options=options) as driver:
    # 任意のHTMLの要素が特定の状態になるまで待つ
    wait = WebDriverWait(driver, 15)
    for url in urls:
        driver.implicitly_wait(10)
        driver.get(url)
        # ページ上のすべての要素が読み込まれるまで待機
        wait.until(EC.presence_of_all_elements_located)
        # PDFとして印刷
        driver.execute_script('window.print()')
        # 待機
        time.sleep(10)
    driver.quit()

ダウンロードフォルダにPDFファイルが保存される。

参考情報

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

[Python]Seleniumを利用したWebページのPDF保存方法 メモ

事前準備

処理フロー

コード

参考情報

[Python]Seleniumを利用したWebページのPDF保存方法メモ