LoginSignup
1
2

More than 5 years have passed since last update.

Javascript を使ったページのスクレイピング (python3)

Posted at

Headless ブラウザー で html をダウンロードして、その html を beautifulsoup4 で解釈する方法です。

実行スクリプトの例

./fetch_html.py https://ekzemplaro.org/storytelling/ storytelling.html
#
./get_table.py storytelling.html
fetch_html.py
#! /usr/bin/python
# ------------------------------------------------------------------
#
#   fetch_html.py
#
#                       Aug/23/2018
# ------------------------------------------------------------------
import  sys
from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as expected
from selenium.webdriver.support.wait import WebDriverWait

## from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# ------------------------------------------------------------------
def file_write_proc(file_name,str_out):
    fp_out = open(file_name,mode='w',encoding='utf-8')
    fp_out.write(str_out)
    fp_out.close()
#
# ------------------------------------------------------------------
sys.stderr.write("*** 開始 ***\n")
url_target = sys.argv[1]
file_html = sys.argv[2]
sys.stderr.write("url_target = " + url_target + "\n")
#
options = Options()
options.add_argument('-headless')
driver = Firefox(executable_path='/usr/bin/geckodriver', firefox_options=options)
ttx = 100
wait = WebDriverWait(driver, timeout=ttx)
driver.get(url_target)
driver.save_screenshot("out.png")
html = driver.page_source
driver.quit()

file_write_proc(file_html,html)
#
sys.stderr.write("*** 終了 ***\n")
# ------------------------------------------------------------------
get_table.py
#! /usr/bin/python
# -*- coding: utf-8 -*-
#
#   get_table.py
#
#                  Aug/23/2018
#
# ------------------------------------------------------------------
import requests
import sys
import json
from bs4 import BeautifulSoup
# ------------------------------------------------------------------
def file_to_str_proc(file_in):
    str_out = ""
    try:
        fp_in = open(file_in,encoding='utf-8')
        str_out = fp_in.read()
        fp_in.close()
    except Exception as ee:
        sys.stderr.write("*** error *** file_to_str_proc ***\n")
        sys.stderr.write(str (ee))
#
    return  str_out
# ------------------------------------------------------------------
def tr_proc(tr):
    no = ""
    title = ""
    tds = tr.findAll("td")
    if 0 < len(tds):
        no = tds[0].get_text()
        title = tds[1].get_text()
#
    return no,title
# ------------------------------------------------------------------
file_html = sys.argv[1]
#
try:
    html = file_to_str_proc(file_html)
    try:
        soup = BeautifulSoup(html, "html.parser")
        for tr in soup.find_all("tr"):
            no,title = tr_proc(tr)
            if no != "":
                print(no,title)
#
    except Exception as ee:
        sys.stderr.write("*** error *** in BeautifulSoup ***\n")
        sys.stderr.write(str(ee) + "\n")
#

except Exception as ee:
    sys.stderr.write("*** error *** in file_to_str_proc ***\n")
    sys.stderr.write(str(ee) + "\n")
#
#
# ------------------------------------------------------------------

実行結果

*** 開始 ***
url_target = https://ekzemplaro.org/storytelling/
*** 終了 ***
s001 ブレーメンの音楽隊
s002 ラプンツェル
s003 おおかみときつね
s004 おいしいおかゆ
s005 かしこいグレーテル
s006 ホレおばさん
s007 みつけどり
s008 ならずもの
s009 はちの女王
s010 いばらひめ
s011 青いあかり
s012 ヨリンデとヨリンゲル
s013 ねことねずみのともぐらし
s014 星の銀貨
s015 三人の糸つむぎ女
s016 金のがちょう
s017 ルンペルシュティルツヘン
s018 死神の名付け親
s019 おおかみと七ひきの子やぎ
s020 千枚皮
s021 ものしり博士
s022 あわれな粉やの若者とねこ
1
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
2