More than 3 years have passed since last update.

BeautifulSoupのインスタンス生成を高速化する方法

Posted at 2020-10-25

BeautifulSoupを使った画像検索botの、実行速度を改善した時の知見です。
スクレイピングの実行速度が遅くて困っている方の参考になれば幸いです。

さっそく方法

環境

python 3.7.9
BeautifulSoup 4.9.3

スクリプト

BeautifulSoupの引数:from_encodingに、適切な文字コードを指定してあげる事で高速化することができます。

from urllib import request
import bs4

page = request.urlopen("https://news.yahoo.co.jp/")
html = page.read()
# from_encodingにスクレイピングするサイトの文字コードを代入(今回のYahooニュースさんの場合utf-8)
soup = bs4.BeautifulSoup(html, "html.parser", from_encoding="utf-8")

文字コードの調べ方

基本的にmetaタグのcharset=以降に書いてあります。

<!-- Yahooニュースさんの場合 -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

実行時間の比較

以下のスクリプトで検証しました。インスタンスを生成する前後で計測しています

verification_bs4.py

from urllib import request as req
from urllib import parse
import bs4
import time
import copy

url = "https://news.yahoo.co.jp/"
page = req.urlopen(url)
html = page.read()
page.close()

start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, None")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, None")

start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, None)")

start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser", from_encoding="utf-8")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, utf-8")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="utf-8")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, utf-8")

start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, utf-8)")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="utf-16")
# 文字コードが違うため戻り値は空
print('{:.5f}'.format(time.time() - start) + "[s] lxml, utf-16")

出力結果はこちらです。

% python verification_bs4.py
2.10937[s] html.parser, None
2.00081[s] lxml, None
0.04704[s] copy(lxml, None)
0.03124[s] html.parser, utf-8
0.03115[s] lxml, utf-8
0.04188[s] copy(lxml, utf-8)
0.01651[s] lxml, utf-16

まとめ

from_encodingに文字コードを指定することによってインスタンスの生成を高速化できました。
BeautifulSoupが遅いと言っている方のコードを見ると、from_encodingに代入していなかったので、それが原因だと思います。

時間がある方向け

何故このような仕様になってるか気になったので、ソースコードを確認してみました。
ただ、普段あまりPythonは触らないので検討はずれなことを書いてる可能性があります
ソースコードはこちら

遅い理由

おそらくbs4/dammit.py内に定義されているEncodingDetectorクラスが原因だと思います。
以下一部コードを抜粋します。

class EncodingDetector:
    """Suggests a number of possible encodings for a bytestring.

    Order of precedence:

    1. Encodings you specifically tell EncodingDetector to try first
    (the override_encodings argument to the constructor).

    2. An encoding declared within the bytestring itself, either in an
    XML declaration (if the bytestring is to be interpreted as an XML
    document), or in a <meta> tag (if the bytestring is to be
    interpreted as an HTML document.)

    3. An encoding detected through textual analysis by chardet,
    cchardet, or a similar external library.

    4. UTF-8.

    5. Windows-1252.
    """
    @property
    def encodings(self):
        """Yield a number of encodings that might work for this markup.

        :yield: A sequence of strings.
        """
        tried = set()
        for e in self.override_encodings:
            if self._usable(e, tried):
                yield e

        # Did the document originally start with a byte-order mark
        # that indicated its encoding?
        if self._usable(self.sniffed_encoding, tried):
            yield self.sniffed_encoding

        # Look within the document for an XML or HTML encoding
        # declaration.
        if self.declared_encoding is None:
            self.declared_encoding = self.find_declared_encoding(
                self.markup, self.is_html)
        if self._usable(self.declared_encoding, tried):
            yield self.declared_encoding

        # Use third-party character set detection to guess at the
        # encoding.
        if self.chardet_encoding is None:
            self.chardet_encoding = chardet_dammit(self.markup)
        if self._usable(self.chardet_encoding, tried):
            yield self.chardet_encoding

        # As a last-ditch effort, try utf-8 and windows-1252.
        for e in ('utf-8', 'windows-1252'):
            if self._usable(e, tried):
                yield e

クラスの最初に書いてあるコメントを翻訳するとこうなります(DeepL翻訳)

    """"バイト文字列のためのいくつかの可能なエンコーディングを提案します。

    優先順位は以下の通りです。

    1. EncodingDetector に最初に試すように指示したエンコーディング
    コンストラクタの引数 override_encodings）を使用します。

    2. bytestring 自体の中で宣言されたエンコーディング。
    XML 宣言 (バイト文字列が XML として解釈される場合)
    ドキュメント)、または<meta>タグ内(バイト文字列が
    HTML ドキュメントとして解釈されます)。

    3. シャルデによるテキスト解析によって検出されたエンコーディング。
    cchardet、または同様の外部ライブラリを使用します。

    4. 4.UTF-8。

    5. Windows-1252。
    """

コメントと処理から推測すると、上の1~5のリストを順に成功するまで、処理しているため遅くなっているのだと思います。
2を見ると、先ほどのmetaタグからの文字コード推測も自動でやってくれるため、webサイトのソースを見て文字コードを指定しなくても使えるようにするための配慮だと思います。
ただ、スクレイピングする際は大体ソースコードを確認すると思うので、ここまで遅くなるならいらない気がします。
(どの処理がネックになってるかの検証はしてないので、誰かよろしくお願いします。)

Copyが早い理由

先ほどの実行時間測定スクリプトで、copy.copy()メソッドでインスタンスの複製を行っていますが、これが早い理由はbs4/init.pyの__copy__にあります。
以下一部コードを抜粋します。

__init__.py

class BeautifulSoup(Tag):

    def __copy__(self):
        """Copy a BeautifulSoup object by converting the document to a string and parsing it again."""
        copy = type(self)(
            self.encode('utf-8'), builder=self.builder, from_encoding='utf-8'
        )

        # Although we encoded the tree to UTF-8, that may not have
        # been the encoding of the original markup. Set the copy's
        # .original_encoding to reflect the original object's
        # .original_encoding.
        copy.original_encoding = self.original_encoding
        return copy

ここでutf-8に決め打ちしているため、早くなっています。
ただ逆に、スクレイピングするサイトの文字コードがutf-8以外だった場合、遅くなります。
以下の測定スクリプトでは、文字コードがshift-jisの価格comさんで測定しています。

verification_bs4_2.py

from urllib import request as req
from urllib import parse
import bs4
import time
import copy

url = "https://kakaku.com/"
page = req.urlopen(url)
html = page.read()
page.close()

start = time.time()
soup = bs4.BeautifulSoup(html, "html.parser")
print('{:.5f}'.format(time.time() - start) + "[s] html.parser, None")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, None")

start = time.time()
soup = bs4.BeautifulSoup(html, "lxml", from_encoding="shift_jis")
print('{:.5f}'.format(time.time() - start) + "[s] lxml, shift_jis")

start = time.time()
hoge = copy.copy(soup)
print('{:.5f}'.format(time.time() - start) + "[s] copy(lxml, shift_jis)")

出力結果はこちらです。

% python verification_bs4_2.py
0.11084[s] html.parser, None
0.08563[s] lxml, None
0.08643[s] lxml, shift_jis
0.13631[s] copy(lxml, shift_jis)

上記のようにcopyがutf-8に比べて遅くなっています。ただ、shift-jisの場合from_encodingに何も指定しなくても、ほとんど実行速度が変わってないです。~~これもうわかんねぇな~~

最後に

ここまで読んでいただきありがとうございました！最後、雑になってしまい申し訳ないです。
全世界のWebサイトの90%以上がutf-8なのに遅いのはどうなの?とは思います。BeautifulSoupで検索して上位にヒットするサイトが、この事に言及していないのが問題に感じ記事を作成しました。
もし、役に立ちましたら「LGTM」していただけると励みになります。

参考
https://stackoverrun.com/ja/q/12619706

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up