More than 5 years have passed since last update.

クローラー／WebスクレイピングAdvent Calendar 2015

@_akisato(AKisato Kimura)

Pythonでかんたんスクレイピング（JavaScript・Proxy・Cookie対応版）

Last updated at 2019-05-06Posted at 2015-12-05

みなさま，おはようございます，@_akisato でございます．

クローラ/webスクレイピングAdvent Calendar http://qiita.com/advent-calendar/2015/crawler の6日目の記事として書いております．

本日は，JavaScriptやcookieを許可していないと読み込めないwebページのスクレイピングについて，紹介します．

実装は GitHub https://github.com/akisato-/pyScraper にアップしてあります．

まずは何の工夫もないスクレイピング

(1) requestsでwebページを取得し，(2) BeautufulSoup4でスクレイピングを実行します．Python標準のHTMLパーサはあまり優秀ではないですので，ここではlxmlを利用します．
BeautifulSoup4の基本的な使い方は http://qiita.com/itkr/items/513318a9b5b92bd56185 などを参照すると良いでしょう．

必要パッケージのインストール

pipを使います．

pip install requests
pip install lxml
pip install beautifulsoup4

ソース

以下のようになると思います．
スクレイピングしたいページのURLと出力ファイル名を指定すると，ページのタイトルなどがJSON形式で帰ってくる仕組みです．
関数scrapingが本体です．

scraping.py

import sys
import json
import requests
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # get a HTML response
    response = requests.get(url)
    html = response.text.encode(response.encoding)  # prevent encoding errors
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content'].text
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

JavaScriptに対応する

JavaScriptを有効にしていないと見られないwebページが非常に増加しています．先ほどまでのソースでそのようなページにアクセスすると，「JavaScriptを有効にして下さい」というページしか取れなくなります．

このようなページに対応するために，requestsで行っていたwebページ取得を，SeleniumとPhantomJSを組み合わせたものに置き換えます．
Seleniumはブラウザ動作を自動化するためのツール，PhantomJSはQtベースのブラウザです．¹

PhantomJSのインストール

MacやLinuxでは，brewやyumなどのパッケージマネージャで即座にインストールできます．

Mac

brew install phantomjs

CentOS

yum install phantomjs

Windowsでは，http://phantomjs.org/download.html からバイナリをダウンロードして，適当な場所に置いた後に，パスを通しておきます．

Seleniumのインストール

pipですぐできます．

pip install selenium

ソース

SeleniumとPhantomJSを利用すると，スクレイピングのソースは以下のように修正されます．webページ取得以降の手順は変更不要です．
SeleniumでPhantomJSのwebドライバを構成し，そのドライバを通じてHTMLを取得します．以降は同じです．
ドライバの動作ログを記録したい場合には，os.path.devnullをファイル名に変更します．

scraping_js.py

import sys
import json
import os
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # Selenium settings
    driver = webdriver.PhantomJS(service_log_path=os.path.devnull)
    # get a HTML response
    driver.get(url)
    html = driver.page_source.encode('utf-8')  # more sophisticated methods may be available
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content'].text
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

Proxyに対応する

PhantomJSの引数としてproxyの設定を入力することができます．

phantomjs_args = [ '--proxy=proxy.server.no.basho:0000' ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)

Cookieに対応する

PhantomJSはデフォルトでCookieが有効になっています．もしcookieファイルを手元に置きたい場合には，PhantomJSの引数に設定することができます．

phantomjs_args = [ '--cookie-file={}'.format("cookie.txt") ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)

ソース最終形態

全ての機能をカバーすると，以下のようになります．

scraping_complete.py

import sys
import json
import os
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # Selenium settings
    phantomjs_args = [ '--proxy=proxy.server.no.basho:0000', '--cookie-file={}'.format("cookie.txt") ]
    driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)
    # get a HTML response
    driver.get(url)
    html = driver.page_source.encode('utf-8')  # more sophisticated methods may be available
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content']
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

PhantomJSはブラウザですので，IE・Firefox・Chromeなど，一般的に利用されるwebブラウザに置き換えることもできます．詳細は公式ドキュメント http://docs.seleniumhq.org/docs/03_webdriver.jsp#selenium-webdriver-s-drivers を見ると良いと思います． ↩

259

270

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Pythonでかんたんスクレイピング （JavaScript・Proxy・Cookie対応版）