More than 5 years have passed since last update.

BeautifulSoup のエラー "Couldn't find a tree builder" の原因と対処法

Posted at 2019-06-29

問題

Python 3 で BeautifulSoup 4 を呼び出すとき

soup = BeautifulSoup(html, "lxml")

以下のエラーが発生することがあります。

Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

原因 1: パーサーをインストールしていない

参照 : Installing a parser - Beautiful Soup 4.4.0 Documentation

BeautifulSoup はパーサーとして html.parser, lxml, html5lib などをサポートしていますが、そのうち html.parser だけが標準で含まれています。 lxml などは BeautifulSoup の依存パッケージに含まれないので、別途インストールが必要です。

対処法

lxml パッケージをインストールする。

$ pip install lxml

原因 2: BeautifulSoup のバグ

<プロジェクトルート>
└───foo.py
└───lib
    └───bs4
    └───lxml

上記のようなディレクトリ構成で foo.py から BeautifulSoup を呼び出そうとすると、同様のエラーが発生します。これはデフォルトの html.parser を選択しても解決しないので、前述のケースとは異なります。

foo.py

from lib.bs4 import BeautifulSoup

html = ...
soup = BeautifulSoup(html, "html.parser")

問題の元凶は BeautifulSoup のソースコードのこの部分です。

def register_treebuilders_from(module):
    """Copy TreeBuilders from the given module into this module."""
    # I'm fairly sure this is not the best way to do this.
    this_module = sys.modules['bs4.builder']

    for name in module.__all__:
        obj = getattr(module, name) 

        if issubclass(obj, TreeBuilder):
            setattr(this_module, name, obj)
            this_module.__all__.append(name)
            # Register the builder while we're at it.
            this_module.builder_registry.register(obj)

module が持つパーサーを this_module へコピーしようとしているわけですが、モジュール名 bs4.builder をハードコードしているので、 bs4 が PYTHONPATH に含まれない場合 this_module が空になります。

そのため、 BeautifulSoup のコンストラクタで指定されたパーサーを探すときに、何も辞書に登録されておらずエラーが生じます。

対処法 1 (推奨)

sys.path に bs4 を追加する。

foo.py

import sys
sys.path.append("lib.bs4")
from bs4 import BeautifulSoup

html = ...
soup = BeautifulSoup(html, "html.parser")

対処法 2

プロジェクトルート直下に bs4 と lxml を配置する。

<プロジェクトルート>
└───foo.py
└───bs4
└───lxml

ダメな方法

BeautifulSoup コンストラクタの引数 builder にパーサーを直接渡す。

foo.py

from lib.bs4 import BeautifulSoup
from lib.bs4.builder._lxml import LXMLTreeBuilder

html = ...
soup = BeautifulSoup(html, builder=LXMLTreeBuilder)

この方法を使うと、パーサーの名前で辞書を検索する処理がスキップされるので、無事 BeautifulSoup のインスタンスが作成されます。

しかし、実際にこれを soup.css("...") などで使おうとするとまた別のエラーが発生します。

TypeError: Expected a BeautifulSoup 'Tag', but instead recieved type <class 'lib.bs4.BeautifulSoup'>

bs4.BeautifulSoup は bs4.Tag のサブクラスなので問題は無さそうに見えますが、このケースでは lib.bs4.BeautifulSoup という名前でインポートされているため別物と判断されてしまいます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up