More than 5 years have passed since last update.

Python スクレイピング/クローリングリファレンス

Last updated at 2018-10-22Posted at 2018-10-21

スクレイピング/クローリングする際によく利用するAPIのリファレンスです。

スクレイピング基本動作

# htmlパース用のオブジェクト作成
soup = BeautifulSoup(html, "html.parser")

# HTMLの中からAタグをすべて取得する(存在しなければNoneが返る)
soup.find_all("a")

# HTMLの中から1番目のAタグを取得する
soup.find("a")

# 条件を付けて取得
soup.find_all("a", class_="link", href="/link")
soup.find_all(class_="link", href="/link")

# テーブルを取得
table = soap.find_All("table",{"class":"tablesorter"})[0]

# URLから取
soup.find_all(href="http://zombie-hunting-club.com"))

正規表現を使ってタグ指定

import re

# BタグやBODYタグなどbで始まるタグをすべて取得
soup.find_all(re.compile("^b"))

# ”link”を含むhref属性を持っているタグをすべて取得するには
soup.find_all(href=re.compile("link"))

# タグの中の文字列に"hello"を含むAタグをすべて取得するには
soup.find_all("a", text=re.compile("hello"))

出力

prettifyを呼び出すことできれいに整形して文字列として出力が出来ます。

その他関数

find_all_previous() と find_previous()

find_next_siblings() と find_next_sibling()

ind_previous_siblings() と find_previous_sibling()

find_parent() と find_parents()

XPathについて

XPathはXML文章中の要素、属性値などを指定するための言語です。
XPathではXML文章をツリーとして捉えることで、要素や属性の位置を指定することができます。
HTMLもXMLの一種とみなすことができるため、XPathを使ってHTML文章中の要素を指定することができます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up