More than 5 years have passed since last update.

Javascript を使ったページのスクレイピング (python3)

Posted at 2018-08-22

Headless ブラウザー　で html をダウンロードして、その html を beautifulsoup4 で解釈する方法です。

実行スクリプトの例

./fetch_html.py https://ekzemplaro.org/storytelling/ storytelling.html
#
./get_table.py storytelling.html

fetch_html.py

# ! /usr/bin/python
# ------------------------------------------------------------------
#
#	fetch_html.py
#
#						Aug/23/2018
# ------------------------------------------------------------------
import	sys
from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support import expected_conditions as expected
from selenium.webdriver.support.wait import WebDriverWait

## from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# ------------------------------------------------------------------
def file_write_proc(file_name,str_out):
	fp_out = open(file_name,mode='w',encoding='utf-8')
	fp_out.write(str_out)
	fp_out.close()
#
# ------------------------------------------------------------------
sys.stderr.write("*** 開始 ***\n")
url_target = sys.argv[1]
file_html = sys.argv[2]
sys.stderr.write("url_target = " + url_target + "\n")
#
options = Options()
options.add_argument('-headless')
driver = Firefox(executable_path='/usr/bin/geckodriver', firefox_options=options)
ttx = 100
wait = WebDriverWait(driver, timeout=ttx)
driver.get(url_target)
driver.save_screenshot("out.png")
html = driver.page_source
driver.quit()

file_write_proc(file_html,html)
#
sys.stderr.write("*** 終了 ***\n")
# ------------------------------------------------------------------

get_table.py

# ! /usr/bin/python
# -*- coding: utf-8 -*-
#
#	get_table.py
#
#				   Aug/23/2018
#
# ------------------------------------------------------------------
import requests
import sys
import json
from bs4 import BeautifulSoup
# ------------------------------------------------------------------
def file_to_str_proc(file_in):
	str_out = ""
	try:
		fp_in = open(file_in,encoding='utf-8')
		str_out = fp_in.read()
		fp_in.close()
	except Exception as ee:
		sys.stderr.write("*** error *** file_to_str_proc ***\n")
		sys.stderr.write(str (ee))
#
	return	str_out
# ------------------------------------------------------------------
def tr_proc(tr):
	no = ""
	title = ""
	tds = tr.findAll("td")
	if 0 < len(tds):
		no = tds[0].get_text()
		title = tds[1].get_text()
#
	return no,title
# ------------------------------------------------------------------
file_html = sys.argv[1]
#
try:
	html = file_to_str_proc(file_html)
	try:
		soup = BeautifulSoup(html, "html.parser")
		for tr in soup.find_all("tr"):
			no,title = tr_proc(tr)
			if no != "":
				print(no,title)
#
	except Exception as ee:
		sys.stderr.write("*** error *** in BeautifulSoup ***\n")
		sys.stderr.write(str(ee) + "\n")
#

except Exception as ee:
	sys.stderr.write("*** error *** in file_to_str_proc ***\n")
	sys.stderr.write(str(ee) + "\n")
#
#
# ------------------------------------------------------------------

実行結果

*** 開始 ***
url_target = https://ekzemplaro.org/storytelling/
*** 終了 ***
s001 ブレーメンの音楽隊
s002 ラプンツェル
s003 おおかみときつね
s004 おいしいおかゆ
s005 かしこいグレーテル
s006 ホレおばさん
s007 みつけどり
s008 ならずもの
s009 はちの女王
s010 いばらひめ
s011 青いあかり
s012 ヨリンデとヨリンゲル
s013 ねことねずみのともぐらし
s014 星の銀貨
s015 三人の糸つむぎ女
s016 金のがちょう
s017 ルンペルシュティルツヘン
s018 死神の名付け親
s019 おおかみと七ひきの子やぎ
s020 千枚皮
s021 ものしり博士
s022 あわれな粉やの若者とねこ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up