More than 5 years have passed since last update.

Python3 の Scraping 基礎(JSON)

Last updated at 2018-03-10Posted at 2018-01-05

今回は，Python3を使ってWebスクレイピングの基礎について触れ合いたいと思います．また，スクレイピングするデータ形式はJSONとします．

ライブラリインストール

スクレイピングするためのライブラリ：

requests: HTTP ライブラリ
beautifulsoup4: htmlパーサー(pythonから呼び出し)

pip3 install requests
pip3 install beautifulsoup4
pip3 install lxml

ライブラリの確認

pip3 freeze | grep -e request -e beautiful

   or

pip3 freeze

JSONについて

今回は，JSON形式のデータをスクレイピングしたいため．ここで，JSONについて簡単に説明を行います．
JSON は「JavaScript Object Notation」の略で、「JavaScript 言語の表記法をベースにしたデータ形式」と言えます。しかし、表記が JavaScript ベースなだけで、それ以外のさまざまな言語で利用できます。JSON では、ある数値と、その数値の名前であるキーのペアをコロンで対にして、それらをコンマで区切り、全体を波かっこで括って表現します。

[{
 "@type" : "LocalBusiness",
 "name" : "東京ディズニーランド(R)"
},
{
 "@type" : "LocalBusiness",
 "name" : "東京ディズニーシー(R)"
},
{
 "@type" : "LocalBusiness",
 "name" : "草津温泉"
},
{
 "@type" : "LocalBusiness",
 "name" : "みなとみらい21"
},
{
 "@type" : "LocalBusiness",
 "name" : "あしかがフラワーパーク"
}]

使用方法

まずライブラリのインポートを行います．

import requests  ##スクレイピング用
from bs4 import BeautifulSoup  ##スクレイピング用
import json  ##Python3 で JSON 形式のデータを扱う方法

HTMLの内容の取得を行います．

target_url = 'www.*****.com'
r = requests.get(target_url)         #requestsを使って、webから取得

htmlパース用のオブジェクト作成します。内部で利用するパーサーを指定する場合は、"html.parser"の部分を"lxml"などに変更します。

soup = BeautifulSoup(r.text "html.parser")
 or
soup = BeautifulSoup(r.content, 'lxml') ##要素の抽出

lxmlは速度が早いのでおすすめです．

ここで，注意すべきことは，文字化けです．
Requestsのr.encodingやr.textをそのまま使うと、文字化けしやすくなります。回避できる方法は以下2つです．

Chardetをインストールしておく
BeautifulSoup()にr.contentを渡してBeautiful Soup側でデコードする

Chardetをインストールすると，大体の場合文字化けを回避できます。シンプルなのでオススメです。

サンプルコード

min_page = 1
max_page = 550


while min_page <= max_page:
    target_url = "https://www.***/page_" + str(min_page) + ".com"
    print(target_url)  ## url表示

    r = requests.get(target_url)         ##　requestsを使って、webから取得
    soup = BeautifulSoup(r.content, "lxml") ##　要素の抽出

    ## 特定のタグの取得　(scriptタグのtypeのapplication/ld+jsonを指定)
    title_part = soup.find_all("script", {"type": "application/ld+json"})

    for i in title_part:
        title = i.get_text() ##タグの中のtext部分のみを指定

        ## JSON ファイルを load 関数で読み込むと、Python で扱いやすいように辞書型で保存されます。
        ## 辞書型なら要素の取り出しなどが容易に出来て便利です．

        a = json.loads(title)
        print(a) ## titleを表示

    min_page += 1 ##ページ数を1追加

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up