More than 5 years have passed since last update.

ウェブスクレイピングを実行したはなし

Last updated at 2019-04-17Posted at 2018-08-18

前書き

これはある学生の日々の学習の成果を記した備忘録であります
実行環境

macOS 10.12.6
Python 3.7.0
beautifulsoup4 4.6.3

やったこと

beautifulsoupを用いてグーグルニュースから記事のURLを抜き取る

以下実行したコード

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request\
            .urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        sp = BeautifulSoup(html,
                           parser)
        for tag in sp.find_all("a"):
            url = tag.get("href")
            if url is None:
                continue
            if "html" in url:
                print("\n" + url)

news = "https://news.google.com/"
Scraper(news).scrape()

参考にした書籍：
独学プログラマー Python言語の基本から仕事のやり方まで

結果

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1016, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 956, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1392, in connect
    server_hostname=server_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 412, in wrap_socket
    session=session
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 850, in _create
    self.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1108, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scraper.py", line 24, in <module>
    Scraper(news).scrape()
  File "scraper.py", line 11, in scrape
    .urlopen(self.site)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)>

考察

SSL関係のエラー、認証の失敗だろうか
詳しい分野ではないので調べて見ることにする

エラーに対する対処

いただいたコメントを参考に、以下のようにコードを変更した

import urllib.request
import ssl
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        ssl._create_default_https_context = ssl._create_unverified_context
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        sp = BeautifulSoup(html,parser)
        for tag in sp.find_all("a"):
            url = tag.get("href")
            if url is None:
                continue
            if "html" in url:
                print("\n" + url)

news = "https://news.google.com"
Scraper(news).scrape()

結果

エラーは発生せず、そして、なんの表示もなく安らかに終了した。

再考察

プログラムは想定どうりに実行されたと仮定して
それならばスクレイピング対象に想定と異なるものがあると疑って
グーグルニュースのソースを調べた
その結果、次のことがわかった

".html"で終わるURLは存在しなかった
記事へのリンクは自動生成されているようなものだった

考えてみたら、"html"で終わるURLなんて最近見た覚えがないではないか

_create_default_https_contextについて

調べて見ると、Python2.7.9以降から、SSL証明書が正しくない場合、デフォルトでエラーを出すようになったらしい
参考SSL証明書が正しくないサイトに対してPythonでアクセスする

そのエラーを回避するためのssl._create_default_https_context = ssl._create_unverified_contextだったようだ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up