More than 5 years have passed since last update.

pythonで認証付きサイトからスクレイピング

Last updated at 2020-04-19Posted at 2020-04-18

はじめに

pythonでDigest認証付きサイトからスクレイピングする方法を紹介したいと思います。（Basic認証とそこまでの差はないのですが。。。）
スクレイピングそのものや，その他の認証に関しては，以下が参考になります。
[Python Webスクレイピング実践入門] (https://qiita.com/Azunyan1111/items/9b3d16428d2bcc7c9406)
[【Python】Basic認証があるページへのスクレイピング]
(https://aga-note.com/python-scraping-basic-auth/)

注意事項
スクレイピングを行う際には，各種規約やマナーを考慮する必要があります。
Webスクレイピングの注意事項一覧

何が必要か？

言語:python 3.7.4
ライブラリ:requests, requests.auth, bs4, urllib.request

ライブラリインスト―ル

下記の2つをpipコマンドでインストールします。

pip install requests
pip install beautifulsoup4

インストールが完了したら実践です。

実践

今回は，以下のサイトの管理者が作成したDigest認証付きWebページのサンプルを，例として利用させて頂きました。
[HTTP クライアントを作ってみよう(6) - Digest 認証編 -]
(http://x68000.q-e-d.net/~68user/net/http-auth-2.html)

import requests
from requests.auth import HTTPDigestAuth
from bs4 import BeautifulSoup

# 1.WebサイトURLとダイジェスト認証のuserとpass
url = 'http://X68000.q-e-d.net/~68user/net/sample/http-auth-digest/secret.html'
username = 'hoge'
password = 'fuga'

# 2.Digest認証付きURLの情報取得
res = requests.get(url,auth=HTTPDigestAuth(username,password))
content = res.content

# 3.htmlデータ取得
# 全データ
data = BeautifulSoup(content, 'html.parser')
# タイトル取得
title = data.title.string
# 本文取得
body = data.body.string
print(title, body)

少し応用

Digest認証付きのURLから直接画像や，Excel等のファイルをダウンロードする場合も一応紹介しておきます。
実際にDigest認証付きのファイルURLが見つからなかったので，方法だけ載せます。

import urllib.request
from requests.auth import HTTPDigestAuth
from bs4 import BeautifulSoup

# 1.WebサイトURLとダイジェスト認証のuserとpass
url = ******************
username = ******************
password = ******************

# 2.Digest認証付きURLのファイル読み込み
# 解説１
password_manager = urllib.request.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, url, username, password)
# 解説２
authhandler = urllib.request.HTTPDigestAuthHandler(password_manager)
opener = urllib.request.build_opener(authhandler)
# ファイル内容の読み込み
file_content = opener.open(url).read()

# 3.ローカルディレクトリにファイルを保存（エクセルを想定しているので，拡張子xlsx）
path = os.path.dirname(os.path.abspath(__file__)) + '/file.xlsx' 
with open(excel_path, mode="wb") as f:
   f.write(file_content)
   print("保存しました")

解説１
Digest認証に必要な情報をパスワード管理オブジェクトの変数に登録します。
*HTTPPasswordMgrWithDefaultRealm()：パスワード管理オブジェクト
*add_password：変数に登録するためのメソッド

解説２
Digest認証付きURLを開く
*HTTPDigestAuthHandler：Digest認証を通すインスタンス作成
*build_opener：認証付きURLを開くインスタンス作成
より詳しく知りたい方は，参考サイトをご覧下さい。

参考

スクレイピングにおけるマナー
Webスクレイピングの注意事項一覧

スクレイピング実践
[Python Webスクレイピング実践入門] (https://qiita.com/Azunyan1111/items/9b3d16428d2bcc7c9406)
[【Python】Basic認証があるページへのスクレイピング]
(https://aga-note.com/python-scraping-basic-auth/)

公式ドキュメント
Authentication — Requests 2.23.0 documentation
urllib.requestの公式ドキュメント

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up