More than 3 years have passed since last update.

Windows10 + Python3 + BeautifulSoup4 を試してみる

Last updated at 2022-06-25Posted at 2022-06-25

目的

Yahoo!ニュースの主要トピックスに対してBeautifulSoupを使って以下を試したときのメモ
・Copy selectorの出力
・Copy XPathの出力
・find、find_allを使った例
検索結果で記述した日付が記述されていない記事は注意が必要
Yahoo!ニュースの構成が変更される頻度が多い??ようなので、まずはtagを検証する

使用するパッケージのインストール

PS > pip3 install BeautifulSoup4
PS > pip3 install lxml

サンプルコード

# ここからは共通コード。
# Windows Add env PYTHONIOENCODING = UTF-8 & restart vscode
# coding:utf-8

from bs4 import BeautifulSoup
import urllib.request
import re
from lxml import etree
from lxml import html

url = 'https://news.yahoo.co.jp/'

with urllib.request.urlopen(url) as response:
    res = response.read()
    url = response.geturl()
    info = response.info()

# ここまでは共通コード。以下パターンごとに記述する

# 指定した１項目を表示する
soup = BeautifulSoup(res, 'html.parser')
# print(soup.prettify())    # 整形された書式で出力を確認
# 以下はいずれでも良いようだ
# elems = soup.select('#uamods-topics > div > div > div > ul > li:nth-child(1) > a')
elems = soup.select('#uamods-topics > div > div > div > ul > li:nth-of-type(1) > a')

for elem in elems:
    print(elem.contents[0])
    print(elem.attrs['href'])

# tag を対象にした検索
# 全８項目＋もっと見る＋全カテゴリのトピックス一覧を表示する
soup = BeautifulSoup(res, 'html.parser')
# 主要トピックの抽出
sect = soup.find("section", attrs ={"id":"uamods-topics"})
# print(sect.prettify()) # 整形された書式で出力を確認

# 主要トピッ内の a tag の抽出
elems = sect.find_all("a")

for elem in elems:
    print(elem.contents[0])
    print(elem.attrs['href'])

# news.yahoo.co.jp/pickup を対象にした検索
# 全８項目＋もっと見る＋全カテゴリのトピックス一覧を表示する
soup = BeautifulSoup(res, 'html.parser')
elems = soup.find_all(href=re.compile("news.yahoo.co.jp/pickup"))

for elem in elems:
    print(elem.contents[0])
    print(elem.attrs['href'])

# XPathを使用してトピック全８項目を表示する
soup = BeautifulSoup(res, 'html.parser')
xmldat = html.fromstring(str(soup))
elems = xmldat.xpath('//*[@id="uamods-topics"]/div/div/div/ul/li[1 or 8]/a')

for elem in elems:
    print(elem.text)
    print(elem.attrib['href'])

参考にしたサイトはこちら

Beautiful Soup 4.9.0 documentation
Beautiful Soup 4.2.0 Doc. 日本語訳
 【Python】スクレイピングにおける要素特定は、XPathかCSSセレクターか
 図解！Python BeautifulSoupの使い方を徹底解説！(select、find、find_all、インストール、スクレイピングなど)
図解！XPathでスクレイピングを極めろ！(Python、containsでの属性・テキストの取得など)
beautifulsoup4 Xpath指定で要素を取得

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up