More than 5 years have passed since last update.

【Python】beautifulsoup4の備忘録

Last updated at 2020-01-24Posted at 2020-01-24

はじめに

beautifulsoup4による、htmlタグ検索の備忘録。

環境

python: 3.7
beautifulsoup4: 4.8.2

基本の検索

# pタグ全部
find_all("p")

# 一番初めに見つかったpタグのみ
find("p")

# aタグかつhrefがhogehogeから始まるもの
import re
find_all("a", href=re.compile("^hogehoge"))

cssセレクタ使った検索

# 親子関係を指定、ゆるく
select('body div p')

# 親子関係その2、きびしく
select('body > div > p')

# クラス名
select('.myclass')

# id名
select('#myid')

# AND条件
select('.myclass1.myclass2')

n番目のタグ

# 下記のhtmlの3番目の<li>タグを検索する
# <html>
# <body>
#   <ul>
#     <li>指定されない</li>
#     <li>指定されない</li>
#     <li>指定される</li>
#     <li>指定されない</li>
#   </ul>
# </body>
# </html>

select('body > ul > li:nth-of-type(3)')

nth-of-type()が効かないときなどの対処法

効かなかった原因は、スクレイピング元サイトのhtmlに、開始タグはあるが閉じタグがないものが存在していたこと。
解決策としては、その開始タグを削除すること。
（ちなみに、Chromeの開発者ツール上では閉じタグは存在していたため、ページのソースを表示してみるまで気づかなかった…）

url = "http://hogehoge/"
soup = BeautifulSoup(url.text, "lxml")

# ddタグの閉じタグがないので、ddタグを除去
for tag in soup.find_all('dd'):
  tag.unwrap()

全ての<dd>タグを除去する。
ただし、.decompose()を使うと、<dd>より後ろにある要素も消えてしまうので、.unwrap()でタグのみを削除。

参考文献

https://www.sukerou.com/2019/01/python3-beautifulsoup4web.html

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up