0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

【Python】BeautifulSoupでHTMLをパースすると勝手に構造が変わる問題

Posted at

問題

BeautifulSoupで書き方の間違ったHTMLをパースすると,勝手に構造を変えられてしまいます。

問題の発生するコード例

html = '<h3><p>テキスト</p></h3>'
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
期待する出力.
<html>
 <body>
  <h3>
    <p>
     テキスト
    </p>
  </h3>
 </body>
</html>
実際の出力.
<html>
 <body>
  <h3>
  </h3>
  <p>
   テキスト
  </p>
 </body>
</html>

解決方法

パーサーをlxmlからhtml.parserに変更すると解決します。
ただし,ルートにhtmlタグbodyタグが自動で追加されなくなるので,必要な場合は自分で追加しましょう。

html = '<h3><p>テキスト</p></h3>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
出力.
<h3>
 <p>
  テキスト
 </p>
</h3>
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?