More than 5 years have passed since last update.

Pythonでwebスクレイピングはじめの一歩

Last updated at 2017-03-06Posted at 2017-03-06

Python3とBeautifulSoup4を使ったwebスクレイピングの初心者向けの記事です。

過去記事を参照しましたが、
警告が表示されたりバージョンの違いからか動かないことがあったため、改めてまとめてみました。

概要

webスクレイピングの基本的な処理は、以下のような流れになります。

　①webページを取得。
　②取得したページを要素を分割し、任意の箇所を取り出す。
　③データベースに保存。

①のwebページの取得にrequest、②の処理にBeautifulSoup4を利用します。
③については環境によって異なるため、この記事では説明を割愛します。

準備

Python3をインストール後、
pipコマンドを使って、BeautifulSoup4、requests、lxmlの３つのパッケージをインストールします。

$ pip install requests 
$ pip install lxml
$ pip install beautifulsoup4

プログラムの実行

以下のスクリプトファイルを作成します。

sample.py

import requests
from bs4 import BeautifulSoup

target_url = 'http://example.co.jp'  #example.co.jpは架空のドメイン。任意のurlに変更する
r = requests.get(target_url)         #requestsを使って、webから取得
soup = BeautifulSoup(r.text, 'lxml') #要素を抽出

for a in soup.find_all('a'):
	print(a.get('href'))         #リンクを表示

コマンドプロンプトを起動し、以下のコマンドを実行します。

$ python sample.py

実行後、コンソールにページのリンクが表示されれば成功です！

BeautifulSoupのメソッド

BeautifulSoupの便利なメソッドを一部紹介します。

soup.a.string　　　　　　　　　　#aタグの文字列を返る
soup.a.attrs    　　　　　　　　#全属性を返る
soup.a.parent　　　　　　　　　　#親要素が返る

soup.find('a') 　　　　　　　　　#先頭の要素が返る
soup.find_all(id='log')　　　　#すべての要素が返る

soup.select('head > title')   #cssセレクターによる指定

BeautifulSoupには、他にも使えるメソッドが多数あります。
詳しくは、公式ドキュメントを参照ください。
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

要素の絞り込み

対象の要素を絞り込むため、reの正規表現を利用すると便利です。

import re
soup.find_all('a', href=re.compile("^http"))     #先頭がhttpではじまるリンク

import re
soup.find_all('a', href=re.compile("^(?!http)")) #先頭がhttpではじまらない(否定)

import re
soup.find_all('a', text=re.compile("N"), title=re.compile("W")) #textがNが含まれる、かつtitleにWが含まれる要素

文字列の操作

スクレイピングの際、覚えておくと便利な文字列操作について補足します。

・文字の前後の空白を削除

"  abc  ".strip()
→abc

・文字を分割

"a, b, c,".split(',') 
→[a, b, c]

・文字列の検索

"abcde".find('c') #指定した文字がある場合、位置を返します。
→2

・文字の置き換え

"abcdc".replace('c', 'x')
→abxdx

参考にした記事

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Pythonでwebスクレイピング はじめの一歩

概要

準備