More than 3 years have passed since last update.

Pythonでscrapingをしている時に同じタグの間の文字を取ってきたい時

Posted at 2021-05-07

やりたいこと

スクレイピングをしている時に、 <div> で中身がまとめられておらず、次のような構造になっていることが多い。

<h2>title01</h2>
<span>content11</span>
<span>content12</span>
<h2>title02</h2>
<span>content21</span>
<span>content22</span>

このような場合に下記のように、指定したタグごとに分割した配列でデータを順番に見たいケースではどうするかをまとめる

[
  # title
  '<h2>title01</h2>',
  # 中身
  '<span>content11</span>
  <span>content12</span>',
  # title
  '<h2>title02</h2>',
  # 中身
  '<span>content21</span>
  <span>content22</span>'
]

コード

from bs4 import BeautifulSoup
import requests
import re

def scrape():
    # scraypingのお作法
    url = 'https://qiita.com/Azunyan1111/items/9b3d16428d2bcc7c9406'
    target_url = url.format(1)
    r = requests.get(target_url)
    soup = BeautifulSoup(r.text, 'html.parser')

    # 区切りたい文字で分割する
    # 一番最初と一番最後の要素は間に含まないので、対象から外す
    # ※ 一番最後のh2タグより下要素を残したい場合は [1:] にする
    elements = re.split(r'<h2>|</h2>', str(soup))[1:-1]

    # h2タグを元々の要素に戻したい場合はfor文で戻す
    # タグがいらない場合は、forを消しても大丈夫
    for index, element in enumerate(elements):
        if index % 2 == 1:
            continue
        elements[index] = '<h2>' + element + '</h2>'

    # 次のような要素が返ってくる
    # ['<h2>title1</h2>', 'content01', '<h2>title2</h2>, 'content02', '<h2>title3</h2>']
    return elements

print(scrape())

note

scrape() では String が返ってくる。Stringの中身を再度スクレイピングしたい場合は、 BeautifulSoup のインスタンスを再度定義する

BeautifulSoup(scrape()[0], 'html.parser')

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up