More than 3 years have passed since last update.

初心者が絶対分かるRequestsでスクレイピングする基本まとめ【Python】

Posted at 2020-07-07

Requestsの基本

import

import requests

必ずこのインポートが必要になります。

Webサイトからソースを取得する

GETメソッドで取得する
POSTメソッドで取得する

この2つを覚えておけば良いでしょう。

GETメソッドで取得する（requests.get）

import requests

url = 'https://www.yahoo.co.jp/'
response = requests.get(url)
print(response) # →<Response [200]>

html = response.text
print(html) # →HTMLソースの文字列

requests.get(url)の返り値はHTTPのステータスコードです。
成功していれば、200が返ってきます。

response.textで求めているHTMLソースの文字列を取得することができます。

POSTメソッドで取得する（requests.post）

POSTメソッドでないと求めているソースが手に入らないことがあります。

data =  {'username':'tarouyamada', 'password':'4r8q99fiad'}

response = requests.post(url, data=data)

これでリクエストボディーを含めた上でリクエストを送ることができます。

リクエストヘッダーを付与する方法

headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', 
'accept': 'application/json'}

response = requests.get(url, headers=headers)

これでリクエストヘッダーが付与された状態でリクエストを送ることができます。
get,postで書き方は共通です。

画像を取得する

.contentを使えばバイナリデータを取得できます。
画像もバイナリデータの1種です。

response = requests.get(url)

img_data = response.content

print(img_data)
# b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x03\x02\x02\x03\x02\x02\x03\x03\x03\x03\x04\x03\x03\x04\x05\x08\x05\x05\x04\x04\x05\n\x07\x07\x06\x08\x0c\n\x0………

print(type(img_data))
# <class 'bytes'>

出力されるのはbytes型

画像の保存

ちなみに取得した画像データを保存したい時
- バイナリファイルの読み書きなので'b'をつける

with open('test.jpg', 'wb') as f:
    f.write(response.content)

パラメータの指定

params = {'q':'qiita', 'date':'2020-7-3'}

response = requests.get(url, params=params)

レスポンスヘッダーを見る

content-typeあたりはテキストなのか、jsonなのか、imageなのかあたりの判定に使えそう

response = requests.get(
    'https://www.pakutaso.com/shared/img/thumb/nekocyan458A3541_TP_V.jpg')

print(response.headers)

# {'Server': 'nginx', 'Date': 'Tue, 07 Jul 2020 22:39:37 GMT', 'Content-Type': 'image/jpeg', 'Content-Length': '239027', 'Last-Modified': 'Sun, 05 Jul 2020 01:51:48 GMT', 'Connection': 'keep-alive', 'ETag': '"5f013234-3a5b3"', 'Expires': 'Thu, 06 Aug 2020 22:39:37 GMT', 'Cache-Control': 'max-age=2592000', 'X-Powered-By': 'PleskLin', 'Strict-Transport-Security': 'max-age=31536000;  includeSubDomains; preload', 'Accept-Ranges': 'bytes'}

リダイレクトがあった場合

リダイレクト先のレスポンスを取得する

リダイレクト途中の履歴を使いたければ.historyを使う

エンコーディングの確認

response = requests.get(
    'https://qiita.com/')

print(response.encoding)

# utf-8

jsonデータを取得する

response.json()で辞書型で取得することができる

response = requests.get(url)

json_dict = response.json()

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up