More than 3 years have passed since last update.

初心者がPythonでウェブスクレイピング(1)

Last updated at 2020-09-13Posted at 2020-09-10

クラウドのクの字も、Pythonのパの字も知らない、
イチからPython＋GCPを勉強し始めてはや１ヶ月。
Pythonでのウェブスクレイピングに興味を持ち始め、
requestsの使い方、requestsオブジェクトの様々な属性、
BeutifuruSoupでのhtmlパースを学習しながら、
まずはYahooニュースのスクレイピングに挑戦します。
　※この記事ではMacOS CatalinaにインストールしたPython3.7.3を使っています。

Pythonでのウェブスクレイピング学習のロードマップ

(1)ローカルでとりあえず目的のブツのスクレイピングに成功する。　←いまココ
(2)ローカルでスクレイピングした結果をGoogleスプレッドシートに連携する。
(3)ローカルでcron自動実行を行う。
(4)クラウドサーバー上での無料自動実行に挑戦する。（Google Compute Engine）
(5)クラウド上で、サーバーレスでの無料自動実行に挑戦する。（たぶんCloud Functions + Cloud Scheduler）

サンプルPGM(1)の機能

・requestsを利用してWEBサイトの情報をget
・BeautifulSoupでhtmlをパース
・文字列検索ができるreライブラリで特定の文字列を検索(ヘッドラインニュースの特定)
・取得した結果リストからニュースタイトルとリンクを全てコンソールに表示

requestsって何？

PythonでHTTP通信を行うための外部ライブラリです。
シンプルにWebサイトの情報収集が可能となります。
pythonの標準ライブラリであるurllibを使ってもurlを取得できますが、
requestsを使うとコード量も少なくシンプルに書けます。
ただ、サードパーティーライブラリなのでインストールが必要です。

requestsのインストール

pipでインストール可能です。
venvで作った仮想環境のまっさらな状態がこちら。

bash

$ virtualenv -p python3.7 env3
% source env3/bin/activate
(env3) % pip list
Package    Version
---------- -------
pip        20.2.3
setuptools 49.2.1
wheel      0.34.2

pipでインストール。pip listでちゃんと入ったか（Versionも）確認しましょう。
付随して、色んなものも入れてくれちゃいます。

bash

(env3) % pip install requests
Collecting requests
  Using cached requests-2.24.0-py2.py3-none-any.whl (61 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting chardet<4,>=3.0.2
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Using cached urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Installing collected packages: idna, chardet, urllib3, certifi, requests
Successfully installed certifi-2020.6.20 chardet-3.0.4 idna-2.10 requests-2.24.0 urllib3-1.25.10
(env3) % pip list
Package    Version
---------- ---------
certifi    2020.6.20
chardet    3.0.4
idna       2.10
pip        20.2.3
requests   2.24.0
setuptools 49.2.1
urllib3    1.25.10
wheel      0.34.2

requestsのメソッド

requestsでは、一般的なHTTPリクエストのメソッドである、
get,post,put,deleteなどのメソッドをサポートしています。
今回はgetを使います。

requestsのresponseオブジェクトの属性

requests.getで帰ってくるresponseオブジェクトには様々な属性が含まれています。
今回サンプルプログラムで、printで確認したのは以下の属性。

属性	確認できるもの
url	アクセスしたURLを取得できる。
status_code	ステータスコード(HTTPステータス)を取得できる。
headers	レスポンスヘッダを取得できる。
encoding	Requestsが推測したエンコーディングを取得できる。

その他、text属性やcontent属性などがあります。

headers属性は、dict型（辞書）で、Yahooニュースでは以下のようにたくさんのキーが含まれているため、サンプルプログラムでは、headers属性のうち'Content-Type'キーを抜き出してprintしています。

bash

{'Cache-Control': 'private, no-cache, no-store, must-revalidate', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=UTF-8', 'Date': 'Wed, 09 Sep 2020 02:24:04 GMT', 'Set-Cookie': 'B=6rffcc5flgf64&b=3&s=sv; expires=Sat, 10-Sep-2022 02:24:04 GMT; path=/; domain=.yahoo.co.jp, XB=6rffcc5flgf64&b=3&s=sv; expires=Sat, 10-Sep-2022 02:24:04 GMT; path=/; domain=.yahoo.co.jp; secure; samesite=none', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Frame-Options': 'DENY', 'X-Vcap-Request-Id': 'd130bb1e-4e53-4738-4b02-8419633dd825', 'X-Xss-Protection': '1; mode=block', 'Age': '0', 'Server': 'ATS', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Via': 'http/1.1 edge2821.img.kth.yahoo.co.jp (ApacheTrafficServer [c sSf ])'}

requests.get部分のソース

requests.getと、取得したresponseオブジェクトの各属性表示部分のソース抜粋はこちら。

url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
# print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTPステータスコード　大抵[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #headersは辞書なのでキー指定でcontent-type出力
print('encoding: ',response.encoding) #エンコーディング

結果がこちら。

bash

(env3) % python requests-test.py
url:  https://news.yahoo.co.jp/
status-code: 200
headers[Content-Type]: text/html;charset=UTF-8
encoding:  UTF-8

Beautiful Soupって何？

Beautiful Soup(ビューティフル・スープ)とは、PythonのWEBスクレイピング用のライブラリで、
HTMLやXMLファイルからデータを取得し、解析することができます。
特定のhtmlタグを抽出するなんてことがカンタンに行えます。

beautifulsoup4のインストール

requestsと同じです。pipでインストール可能です。

bash

(env3) % pip install beautifulsoup4
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.9.1-py3-none-any.whl (115 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1
(env3) % pip list                  
Package        Version
-------------- ---------
beautifulsoup4 4.9.1
certifi        2020.6.20
chardet        3.0.4
idna           2.10
pip            20.2.3
requests       2.24.0
setuptools     49.2.1
soupsieve      2.0.1
urllib3        1.25.10
wheel          0.34.2

Beautiful Soupの引数

Beautiful Soupには、一つ目の引数として解析対象のオブジェクト（htmlやxml）
（サンプルで言うところの、requestsでgetしたresponseオブジェクト）
二つ目の引数として解析に利用するパーサーを指定します。

パーサー	使用例	強み	弱み
Python’s html.parser	BeautifulSoup(response.text, "html.parser")	標準ライブラリ	Python2系/3.2.2未満非対応
lxml’s HTML parser	BeautifulSoup(response.text, "lxml")	爆速	install必要
lxml’s XML parser	BeautifulSoup(response.text, "xml")	爆速。唯一のxmlパーサー	install必要
html5lib	BeautifulSoup(response.text, "html5lib")	HTML5を正しく処理できる	install必要。とっても遅い。

soup = BeautifulSoup(response.text, "html.parser")

BeautifulSoupには様々なメソッドがありますが、今回はfind_allメソッドを使います。
また、find_allメソッドにも様々な引数が設定できますが、今回はキーワード引数を使います。

find_all:キーワード引数

キーワード引数としてタグの属性を指定し、一致するタグの情報を取得できます。

キーワード引数の値もまた、文字列、正規表現、リスト、関数、True値をとることができます。そして、複数のキーワード引数も指定できます。

例えば、キーワード引数としてhref に値を渡すと、Beautiful SoupはHTMLタグのhref属性に対してフィルタリングを行います。

引用：https://ai-inter1.com/beautifulsoup_1/#find_all_detail

つまり、「href属性の値が指定の正規表現にマッチするもの」を、
soupオブジェクトの中からfind_allすることで、以下の例では、
href属性の中で"news.yahoo.co.jp/pickup"が含まれているもののみ全て抽出することが可能となります。

elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))

最終的なサンプルソース

最後はfor文で回して、抽出したニュースのタイトルとリンクをコンソール表示。
最終的なサンプルソースはこちら。

requests-test.py

import requests
from bs4 import BeautifulSoup
import re

# requestsを利用してWEBサイトの情報をダウンロード
url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
# print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTPステータスコード　大抵[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #headersは辞書なのでキー指定でcontent-type出力
print('encoding: ',response.encoding) #エンコーディング

# BeautifulSoup()に取得したWEBサイトの情報とパーサー"html.parser"を渡す
soup = BeautifulSoup(response.text, "html.parser")

# href属性の中で"news.yahoo.co.jp/pickup"が含まれているもののみ全て抽出
elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))

# 抽出したニュースのタイトルとリンクをコンソール表示。
for elem in elems:
    print(elem.contents[0])
    print(elem.attrs['href'])

PGMの部分部分は、参考サイトに載せさせて頂いたサイトのパクリに近いです。
大いに参考にさせていただきました。

あとがき

確認のためのrequestsのresponseオブジェクトのprintと、import部分をのぞき、
たったの７行でウェブスクレイピングできてしまう。
Pythonと先人たちのライブラリ、恐るべし。

結果はこちら。とりあえずスクレイピングできました！
最後の一つの写真つきニュースは余計ですが、対処がわからないので、とりあえずこのまま。。。

bash

% python requests-test.py
url:  https://news.yahoo.co.jp/
status-code: 200
headers[Content-Type]: text/html;charset=UTF-8
encoding:  UTF-8
ドコモ口座 連携銀の過半停止
https://news.yahoo.co.jp/pickup/6370639
菅氏 自衛隊に関する発言訂正
https://news.yahoo.co.jp/pickup/6370647
3年連続冠水 イチゴ農家苦悩
https://news.yahoo.co.jp/pickup/6370631
海に4人乗った車転落 2人死亡
https://news.yahoo.co.jp/pickup/6370633
新疆でムーラン撮影 再び反発
https://news.yahoo.co.jp/pickup/6370640
親が偏見 パニック障害で苦悩
https://news.yahoo.co.jp/pickup/6370643
平岡卓被告 懲役2年6カ月求刑
https://news.yahoo.co.jp/pickup/6370646
伊勢谷容疑者 巻紙500枚押収
https://news.yahoo.co.jp/pickup/6370638
<span class="topics_photo_img" style="background-image:url(https://lpt.c.yimg.jp/amd/20200909-00000031-asahi-000-view.jpg)"></span>
https://news.yahoo.co.jp/pickup/6370647

参考サイト：
https://requests-docs-ja.readthedocs.io/en/latest/
https://ai-inter1.com/beautifulsoup_1/
http://kondou.com/BS4/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up