More than 1 year has passed since last update.

BeautifulSoup4：「htmlのタグと属性が同じ複数の要素」から、その周辺の情報を使って特定の要素を抽出する

Last updated at 2022-09-18Posted at 2022-09-16

やりたいこと

例えばhtmlファイルの中に次のような記述があったとする。

sample.html

...
<h3>朝</h3>
<p class="greeting">おはよう</p>
<h3>昼</h3>
<p class="greeting">こんにちは</p>
<h3>夜</h3>
<p class="greeting">こんばんは</p>
...

この中から、「昼」を検索キーとして対応するp要素、<p class="greeting">こんにちは</p>を抽出したい。

解決策

Pythonのhtml解析ツール「BeautifulSoup4」を用いて、以下のように<h3>昼</h3>要素を抽出した後に、find_next関数で、条件を指定して検索すればOK。
ちなみにfind_previousは対象要素よりも前の要素を検索する。

from bs4 import BeautifulSoup

with open("sample.html", "r") as f:
  text = f.read()  
  soup = BeautifulSoup(text, features="lxml")

  # h3タグの要素を全て抽出
  # h3s = [<h3>朝</h3>, <h3>昼</h3>, <h3>夜</h3>] となる
  h3s = soup.find_all("h3")

  # テキストが昼の要素を抽出
  # h3 = [<h3>昼</h3>] となる
  h3 = list(filter(lambda x: x.text == "昼", h3s))

  # h3[0]よりも後の要素のうち、条件を満たすものを抽出
  # greeting_noon = <p class="greeting">こんにちは</p> となる
  greeting_noon = h3[0].find_next("p", attrs={"class": "greeting"})

  # h3[0]よりも前の要素のうち、条件を満たすものを抽出
  # greeting_morning = <p class="greeting">おはよう</p> となる
  greeting_morning = h3[0].find_previous("p", attrs={"class": "greeting"})

参考サイト

↑本記事で扱ったfind_next(previous)以外にも、様々な実例が書かれていて参考になる。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up