More than 5 years have passed since last update.

HTMLソースの特定の要素を削除・置換【BeautifulSoup】

Posted at 2015-08-06

HTMLのスクレイピング処理で、特定の条件に合う要素を削除したり置換する方法

（＊例えば、リンクをすべてスキップしたい、図表は飛ばす、など）

Python BeautifulSoupで、.extract(), .replace_with() 関数を使う。

from bs4 import BeautifulSoup

txt = """<p>I have a dog.  His name is <span class="secret">Ken</span>.</p>"""
soup = BeautifulSoup(txt)

# This keeps "unwanted" information
soup.get_text()
# : u'I have a dog.  His name is Ken.'


# remove an element by tag matching 
soup.find("span", {"class":"secret"}).extract()
soup.get_text()
# : u'I have a dog.  His name is .'


# or you can replace that with something
soup = BeautifulSoup(txt)
soup.find("span", {"class":"secret"}).replace_with("confidential")
soup.get_text()
# : u'I have a dog.  His name is confidential.'

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

HTMLソースの特定の要素を削除・置換 【BeautifulSoup】

HTMLのスクレイピング処理で、特定の条件に合う要素を削除したり置換する方法

HTMLソースの特定の要素を削除・置換【BeautifulSoup】