More than 5 years have passed since last update.

Beautiful SoupでHTMLを扱ったのでメモ

Last updated at 2017-07-13Posted at 2017-06-15

Beautiful Soupの基本的な使い方に関してはPythonとBeautiful Soupでスクレイピングを見ていただければいいかと思います。

今回はBeautiful SoupをHTMLを扱う機会があったので、その時に使った処理のtipsというかメモというか備忘録というか、まあ、そういうものです。
随時更新する(かも)

特定のタグを異なるタグに置き換える

for texttag in content.find_all('text'):
	texttag.name = 'p'

全ての<text>を<p>に置き換えています

特定のタグに囲まれていない要素にタグを追加する

for imgtag in content.find_all('img'):
	if not imgtag.parent.name in ['figure']:
		imgtag.wrap(content.new_tag('figure'))

<figure>に囲まれていない<img>を見つけて、<figure>で囲んでいます

もしくは下記の方法でも同様の処理を行えます

for notwrap_a in content.select("p ~ a"):
	notwrap_a.wrap(content.new_tag("p"))

<p>に囲まれていない<a>を見つけて<p>で囲んでいます

### リストから先頭の要素以外を除去

for tag in content.find_all('ul'):
	tag.find('li').unwrap()
	
for unwarp_ul in content.find_all('ul'):
	unwarp_ul.unwrap()

for delete_li in content.find_all('li'):
	delete_li.decompose()

まず、最初の処理で<ul>を見つけて、find('li').unwrapでリストの先頭の要素から<li>を外しています。
次に、<ul>を外し、最後に残った<li>を除去しています。
先頭の要素は<li>が外れた状態ですので、新たにタグを付与する場合は、

tag.find('li').unwrap()

を

first_li = tag.find('li')
first_li.name = 'p'

とかにしてあげると良いと思います

特定の要素の親要素を外す

for p in soup.find_all('p'):
    p.parent.unwrap()

<p>の親要素を外しています

指定した要素の隣の要素も一緒にwrapする

以下のようなhtmlがあるとする

<img src="00001.jp">
<figcaption>caption string1</figcaption>

<img src="00002.jp">

<img src="00003.jp">
<figcaption>caption string3</figcaption>

<img>のとなりに<figcaption>があった場合は、<figure>で囲む、ということがしたい場合は以下のようにすればいける

html = "<img src="00001.jp">
<figcaption>caption string1</figcaption>

<img src="00002.jp">

<img src="00003.jp">
<figcaption>caption string3</figcaption>"

content = BeautifulSoup(html)

for img_tag in content.find_all('img'):          
    fig = content.new_tag('figure')
    img_tag.wrap(fig)

    next_node = img_tag.find_next() 
    if next_node and next_node.name == 'figcaption':        
        fig.append(next_node)

print(content)

このようにすると、以下のように編集されるっす

<figure>
   <img src="00001.jp"/>
   <figcaption>caption string1</figcaption>
</figure>
<figure><img src="00002.jp"/></figure>
<figure>
   <img src="00003.jp"/>
   <figcaption>caption string3</figcaption>
</figure>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up