More than 5 years have passed since last update.

運用管理ツール(Webアプリ)のjsonDL機能を気軽にローカルに保存したかった話

Last updated at 2019-08-26Posted at 2019-08-21

前置き

2019/8/26 追記あり（めっちゃ遅いレスポンス時にエラーになるケースがある）
商品管理をしているサーバに対する運用管理ツールにおいて、全商品のDL(.json)みたいな機能が存在
いちいち要ログインの管理ツールにアクセスして、DLボタン押して保存して、作業ディレクトリに移動とか面倒
一連の流れを自動化する想定で、コマンドベースでローカルのディレクトリにjsonを保存したい
管理ツール的には https://xxxx.yyyy.zz/aaa/bbb/download のリンクでアクセスさえできればDL可能

結論

wgetじゃ無理（ファイルの実体が指定できない）
curlじゃ無理（ログインが必要な管理ツールなので、何か無理だった）
スクレイピングで頑張る

詰まった点

ログイン画面を突破しないと叩きたいURL https://xxxx.yyyy.zz/aaa/bbb/download にたどり着けない（404になる）
- → ログインして、該当のURLを叩くようにする
ログイン画面でid/passをpostする際のURLが分からん
- → 調べ方があるのでそれで対応
詰まったではないが、逐次処理にしたらゲロ遅かった（約730[sec]）
- → 並列処理にした
並列処理時に、パラメータを2つ以上渡すのがシンプルにできなかった
- → 文字列の配列をルール決めて渡すようにした

対応方法

1. ログイン画面のpostするURLを調べる

chromeでデベロパーツール開く
デベロパーツール上で、Networkタブを選択
ログイン画面開く
いつも通りログインしてみる
ログイン時の内容が結果のとこに出るので、Nameらへんでpost時のURLがわかる

参考：Chrome デベロッパーツールの使い方まとめ

2. スクレイピングするスクリプトを頑張って書く

dl_json.py

import sys
import yaml
import json

from bs4 import BeautifulSoup
import requests

from concurrent import futures


# yamlにid/pass定義して、それをロードする形にした
with open('config.yaml') as file :
    config_yml = yaml.load(file, Loader=yaml.SafeLoader)

# 実際ログイン画面のid/passで定義されているhtmlタグの属性idの値にブッコムhashを作成
payload = {
    'login_id': config_yml['login']['id'],
    'password': config_yml['login']['pass'],
}

# 並列処理するために、1セッション1jsonDLできるように関数定義する
def dl_process(args) :
    """
    プロセス単位の各種DL処理
    Args:
        args : 以下の2データをstrのlistで
            id : 100 | 200
            category : normal | special
    """
    id = args[0]
    category = args[1]

    # authenticity_tokenの取得
    session = requests.Session()
    res = session.get('https://xxxx.yyyy.zz/')  # 管理ツールのtop画面URL
    soup = BeautifulSoup(res.text, 'html.parser')
    auth_token = soup.find(attrs={'name': 'authenticity_token'}).get('value')
    payload['authenticity_token'] = auth_token

    # login
    res = session.post('https://xxxx.yyyy.zz/login', data=payload) # 対応手順1で調べたpost時のURL
    res.raise_for_status()

    # DL
    url = 'https://xxxx.yyyy.zz/aaa/bbb/' + id +'/' + category + '/download'
    res = session.get(url)

    # パスをテキトーに指定して保存
    path_w = category + '_' + id + '.json'
    file = open(path_w, 'w')
    file.write(res.text)


# 各種必要なjsonをDL＆saveを並列実行
with futures.ProcessPoolExecutor() as executor:
    f1 = executor.submit(dl_process, ['100', 'normal'])
    f2 = executor.submit(dl_process, ['100', 'special'])
    f3 = executor.submit(dl_process, ['200', 'normal'])
    f4 = executor.submit(dl_process, ['200', 'special'])
    f1.result()
    f2.result()
    f3.result()
    f4.result()

感想

並列処理にしたら約730[sec]から、約210[sec]まで縮んだ。（3倍以上）
スクレイピングしないでもっと楽にできる方法が欲しい……
ちょっと書き方が雑感

参考

追記

めっちゃレスポンスに時間かかる場合にエラーになる場合があった（2019/8/26）

こんなエラー（一部）

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/xxx/.pyenv/versions/3.7.3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
    chunked=chunked)
  File "/Users/xxx/.pyenv/versions/3.7.3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/Users/xxx/.pyenv/versions/3.7.3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/xxx/.pyenv/versions/3.7.3/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/Users/xxx/.pyenv/versions/3.7.3/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/Users/xxx/.pyenv/versions/3.7.3/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/xxx/.pyenv/versions/3.7.3/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/Users/xxx/.pyenv/versions/3.7.3/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 309, in recv_into
    raise SocketError(str(e))
OSError: (54, 'ECONNRESET')

During handling of the above exception, another exception occurred:

~同じようなエラー
urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError("(54, 'ECONNRESET')"))

~同じようなエラー
requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(54, 'ECONNRESET')"))

色々try

OSError: (54, 'ECONNRESET') を調べてみると、どうにもサーバから通信を切られたっぽい雰囲気。
ただ、明らかに成功していた時よりレスポンスの長いケースなのでTimeout時間だとか、Stream指定とか試してみたけど、事象は変わらず……

結論

selenium でスクレイピングするとエラーは避けられた。
上記requestsの処理に合わせて、ブラウザ起動しないでスクレイピングしようとすると、jsonDLが走らないので、仕方なくブラウザ表示しながらの対応
（ブラウザの機能で保存しているからできないとか……？詳しく調べられてはいない）

対応準備

selenium install

pip install selenium

Chrome Driver install
- マシンのChromeに合ったverをココからDL：https://sites.google.com/a/chromium.org/chromedriver/downloads
- zip解凍して、 /usr/local/binとかに置いちゃう
- 補完効くか念のため確認

こんな感じのモジュール

dl_process と同じ流れの処理にするために以下のように定義。（共通化は一旦無視）

dl_json.py(追記分)

import os
import glob

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

BIG_REQUEST_TIMEOUT = 700

def dl_process_by_selenium(args) :
    """
    プロセス単位の各種DL処理
    Args:
        args : 以下の2データをstrのlistで
            id : 100 | 200
            category : normal | special
            file_name_prefix : normal_ | special_
    """
    game_id = args[0]
    category = args[1]
    file_name_prefix = args[2]

    # 準備
    options = Options()
    options.add_experimental_option("prefs", {
    "download.default_directory": os.getcwd(), # DL先をカレントディレクトリ指定
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "safebrowsing.enabled": True
    })
    driver = webdriver.Chrome(options=options)

    # login
    driver.get('https://xxxx.yyyy.zz/login/new') # login画面

    id = driver.find_element_by_id("id") # login画面におけるidを参考に記載
    id.send_keys(config_yml['login']['id'])
    password = driver.find_element_by_id("pass") # login画面におけるidを参考に記載
    password.send_keys(config_yml['login']['pass'])

    time.sleep(1)

    login_button = driver.find_element_by_name("login") # login画面におけるnameを参考に記載
    login_button.click()

    # DL
    url = 'https://xxxx.yyyy.zz/aaa/bbb/' + id +'/' + category + '/download'

    timeout_start = time.time()
    driver.set_page_load_timeout(BIG_REQUEST_TIMEOUT)
    try :
        driver.get(url)
    except TimeoutError as e :
        timeout_elapsed_time = time.time() - timeout_start
        print ("time_out_elapsed_time:{0}".format(timeout_elapsed_time) + "[sec]")
   
    # 保存に時間かかるケースあるのでちょっと待つ
    time.sleep(10)

    # あると信じて都合の良いリネーム処理する
    path_w = category + '_' + id + '.json'
    dl_file_path_list = glob.glob("./" + file_name_prefix + id + "*.json")
    os.rename(dl_file_path_list[0], "./" + path_w)

    # 全てが終わったら終了
    driver.quit()

追記分参考（めっちゃレスポンスに時間かかる場合）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up