More than 3 years have passed since last update.

静岡のGoToEat公式サイトをスクレイピング、伊豆のキャンペーン対象店をリスト化する

Last updated at 2020-11-04Posted at 2020-11-04

静岡では２種類の食事券を販売している。

	赤富士券	青富士券
金額	1冊 8,000円 (10,000円利用可能)	1冊 10,000円 (12,500円利用可能)
URL	https://premium-gift.jp/fujinokunigotoeat/	https://gotoeat-shizuoka.com/
robots メタタグ	index,follow	noindex, follow

そもそも何故スクレイピングしようと思ったかというと、どちらのサイトも対象店舗がリストで閲覧する事が出来ないためだ。食事券購入者は、対象店舗を名前などで絞り込むか何度もページャーのリンクを遷移しなければならない。公式サイトのrobots メタタグを確認する限り、青富士券はnoindexとなっているので赤富士券のみを対象とする。

Webスクレイピング

urllib

Webスクレイピングには様々なやり方とライブラリが用意されている。
まずは情報量が多く、簡単そうなurllibライブラリを選択した。

    # 取得例
    import urllib.request
    html = urllib.request.urlopen(url).read()

がしかし、検索条件の指定が効かず同じページばかり表示されてしまう事象が発生。
ブラウザで同じURLを開く場合も新規と２回目で表示内容が異なる。どうやらSessionを判断している様子。

Selenium

色々調べた結果、Selenium で実現できた。
SeleniumはWeb アプリケーションのテスト自動化ライブラリで、プログラムからブラウザを操作できる。

    # 取得例
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    driver = webdriver.Chrome('chromedriver',options=options)
    driver.get(url)
    html = driver.page_source.encode('utf-8')

Google Colaboratory の場合は下記を実行すれば動かせます。

インストール

    !apt-get update
    !apt install chromium-chromedriver
    !cp /usr/lib/chromium-browser/chromedriver /usr/bin
    !pip install selenium

ソース

    import pandas as pd
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    # ブラウザをバックグラウンド実行
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    area_nm = '伊豆市'
    df_all = pd.DataFrame(columns=['area_nm', 'shop_nm'])
    # ブラウザ起動
    driver = webdriver.Chrome('chromedriver',options=options)
    driver.implicitly_wait(10)
    # 初期画面
    driver.get("https://premium-gift.jp/fujinokunigotoeat/use_store")
    driver.implicitly_wait(10)
    print(driver.current_url)
    # 検索実行
    driver.find_element_by_id('addr').send_keys(area_nm)
    driver.find_element_by_class_name('store-search__submit').click()
    driver.implicitly_wait(10)
    print(driver.current_url)
    shouldLoop = True
    while shouldLoop is True:
      # 検索結果
      current_url = driver.current_url
      shop_nm_list = driver.find_elements_by_class_name("store-card__title")
      for idx, shop_item in enumerate(shop_nm_list):
        row = pd.Series( [ area_nm, shop_item.text ], index=df_all.columns )
        df_all = df_all.append(row, ignore_index=True )
        print(shop_item.text)
      
      # 次のページへ
      link_list = driver.find_elements_by_class_name('pagenation__item')
      for link_item in link_list:
        if link_item.text == "次へ":
          link_item.click()
          driver.implicitly_wait(10)
          print(driver.current_url)
      
      shouldLoop = False
      # 表示するページがない場合、終了
      if current_url != driver.current_url:
          shouldLoop = True
    driver.close()
    # CSV出力
    df_all.to_csv(f'shoplist.csv', index=False)

最後に

赤富士券と青富士券のサイトが改善される事を望みます。
検索は伊豆市だけをキーワードとしていますが、条件を変える事も可能です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up