More than 5 years have passed since last update.

BeautifulSoupのgetTextでエラーが頻発する

Posted at 2018-10-14

スクレイピングが不定期に失敗する

以前の記事Kindle Paper Whiteにニュース記事を自動転送してみた
この中で記事のタイトルと内容をスクレイピングする箇所があるのですが
３回に１回くらいの割合でAttributeエラーが頻発するのです・・・．
はじめに書いていたソースはこんな感じ．

スクレイピング

        with urllib.request.urlopen(url) as response:
            html = response.read()

        # Decode to utf-8. Exceptions will be replaced
        html = html.decode('utf-8','replace')

        # Parse it to elements
        soup = BeautifulSoup(html, "html.parser")

        # Get title and content of article
        self.title = soup.find("h1").get_text()
        self.article = soup.find("div", {"itemprop":"articleBody"}).get_text()

BeautifulSoupの記述方法がおかしいのか
いや，フェッチするたびに内容は変わらないだろうと右往左往・・・・・

レスポンスヘッダーを比べてみる

そこで，成功したときと失敗したときのレスポンスのヘッダーを比べて見ました．

失敗したとき

200
OK
Server: nginx/1.15.5
X-Cache-Status: HIT
Cache-Control: max-age=0, public, s-maxage=180
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Date: Sat, 13 Oct 2018 02:05:29 GMT
Expires: Sat, 13 Oct 2018 02:03:31 GMT
Transfer-Encoding: chunked
Connection: close
Set-Cookie: X-Mapping-fjhppofk=D375F6D40F5119E7EE3B8D6F2D84ED5F; path=/
Last-Modified: Sat, 13 Oct 2018 02:03:31 GMT

成功したとき

200
OK
Server: nginx/1.15.5
Cache-Control: max-age=0, public, s-maxage=180
Content-Type: text/html; charset=UTF-8
Date: Sat, 13 Oct 2018 02:05:28 GMT
Expires: Sat, 13 Oct 2018 02:05:28 GMT
Transfer-Encoding: chunked
Connection: close
Set-Cookie: X-Mapping-fjhppofk=EBCD18ED55EF4AE9B8B1DC281AE205B0; path=/
Set-Cookie: japantoday=977tqlh5ob1ebc9fffikjoskq0; expires=Sun, 14-Oct-2018 02:05:27 GMT; Max-Age=86400; path=/; domain=japantoday.com
Last-Modified: Sat, 13 Oct 2018 02:05:28 GMT

多分，この辺が違う

Content-Encoding: gzip
X-Cache-Status: HIT

失敗しているときはgzip形式で圧縮されて送られて来ている．
多分これが原因．

キャッシュされたものを受け取るときはgzipになるのかな？
（詳しい人教えてください）

リクエストヘッダーでgzipで受け入れ可能ということを伝えて
受信した内容をデコードすれば一件落着．

スクレイピングgzip　(JapanTodayArticle.pyから抜粋)

import gzip
    def fetch(self):
        """
        fetch html data and parse title and content
        """
        headers = {'Accept-Encoding': 'gzip'}
        request = urllib.request.Request(url=self.url, headers=headers)

        with urllib.request.urlopen(request) as response:
            html = response.read()

        # Decompress gzip-fetched html
        html = gzip.decompress(html)

Kindle Paper Whiteにニュース記事を自動転送してみた
上のソースコードはすべてGitHubに載せています．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up