Beautiful Soupを使ってWebスクレイピング

Posted at 2024-06-14

前回の記事では、PythonのRequestsライブラリを使ってウェブデータを取得する方法を記載しました。
今回は、取得したウェブデータを解析するためのライブラリ「Beautiful Soup」の基本的な使い方について記載します。

Beautiful Soupのインストール

まず、Beautiful SoupとRequestsライブラリをインストールします。
以下のコマンドを実行してください。

pip install beautifulsoup4 requests

Beautiful Soupの基本的な使い方

1. ウェブページの取得

Beautiful Soupを使うためには、まずウェブページの内容を取得する必要があります。
これには前回紹介したRequestsライブラリを使います。

import requests
from bs4 import BeautifulSoup

# URLの指定
url = 'https://example.com'

# ウェブページの内容を取得
response = requests.get(url)

# ステータスコードが200（成功）の場合のみ解析を実行
if response.status_code == 200:
    html = response.content
else:
    print('Failed to retrieve the webpage')

2. HTMLの解析

取得したHTMLをBeautiful Soupを使って解析します。
解析のためにBeautifulSoupオブジェクトを作成します。

# BeautifulSoupオブジェクトの作成
soup = BeautifulSoup(html, 'html.parser')

# HTMLの整形表示
print(soup.prettify())

3. 要素の抽出

Beautiful Soupを使うと、特定の要素を簡単に抽出できます。
ここでは、タイトルタグ（<title>）の内容を抽出してみます。

# タイトルタグの抽出
title_tag = soup.title

# タイトルの文字列を取得
title = title_tag.string
print(f"ページタイトル: {title}")

次に、任意のタグやクラスを持つ要素を抽出する方法を紹介します。
例えば、すべてのリンク（<a>タグ）を抽出します。

# すべてのリンクを抽出
links = soup.find_all('a')

# リンクの表示
for link in links:
    href = link.get('href')
    text = link.string
    print(f"リンクのテキスト: {text}, URL: {href}")

4. 特定のクラスやIDを持つ要素の抽出

特定のクラスやIDを持つ要素を抽出することも可能です。
例えば、特定のクラスを持つ<div>タグを抽出してみます。

# 特定のクラスを持つ<div>タグの抽出
divs = soup.find_all('div', class_='example-class')

# 抽出した要素の表示
for div in divs:
    print(div.prettify())

同様に、特定のIDを持つ要素を抽出することもできます。

# 特定のIDを持つ要素の抽出
element = soup.find(id='example-id')

# 抽出した要素の表示
print(element.prettify())

5. Beautiful Soupの便利な機能

Beautiful Soupには、他にも便利な機能が多数あります。
例えば、子要素や親要素の取得、隣接要素の取得などです。

# 子要素の取得
children = element.find_all('li')

# 子要素の表示
for child in children:
    print(child.string)

# 親要素の取得
parent = element.parent
print(parent.prettify())

# 次の兄弟要素の取得
next_sibling = element.find_next_sibling()
print(next_sibling.prettify())

以上

前回

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up