More than 3 years have passed since last update.

静岡のGoToEat公式サイトをスクレイピング

Last updated at 2020-11-08Posted at 2020-11-08

はじめに

静岡のGoToEat公式サイトをスクレイピング、伊豆のキャンペーン対象店をリスト化する

の記事の中に

import urllib.request
html = urllib.request.urlopen(url).read()

検索条件の指定が効かず同じページばかり表示されてしまう事象が発生。
ブラウザで同じURLを開く場合も新規と２回目で表示内容が異なる。どうやらSessionを判断している様子。

ということなので調べてみた

URLが変わる

「伊豆市」検索直後のページ
https://premium-gift.jp/fujinokunigotoeat/use_store?events=search&id=&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=

次へ（2ページ目）
https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=2&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=

戻る（1ページ目）
https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=1&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=

検索直後のページと戻る（1ページ目）との違いが「events=search」と「events=page」、「id=」と「id=1」でURLが変わっているみたい

戻る（1ページ目）と次へ（2ページ目）との違いが「id=1」と「id=2」なのでidがページ数とわかりました

アクセス

試しに戻る（1ページ目）のURLからスタートすると検索結果が反映されていないので表示内容が違うため

「伊豆市」検索直後のページのURLにアクセス
戻る（1ページ目）のURLにアクセス

の順番でアクセスするといけるようです

次のページのURLはheadのlinkのなかにURLが見つかりましたのでそちらを利用、
試しに検索直後のページから次のページを取得していくと「id=2」次は「id=22」次は「id=222」と
2が増えたページが返ってきます（ｗ

スクレイピング

import requests
from bs4 import BeautifulSoup

import time

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
}

with requests.Session() as s:

    # 全部取得する場合はサーチは不要
    # url = "https://premium-gift.jp/fujinokunigotoeat/use_store"

    # 検索の場合はサーチページを一旦表示してからアクセス
    s.get("https://premium-gift.jp/fujinokunigotoeat/use_store?events=search&id=&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry=")
    url = "https://premium-gift.jp/fujinokunigotoeat/use_store?events=page&id=1&store=&addr=%E4%BC%8A%E8%B1%86%E5%B8%82&industry="

    result = []

    while True:

        r = s.get(url, headers=headers)
        r.raise_for_status()

        soup = BeautifulSoup(r.content, "html.parser")

        for store in soup.select("div.store-card__item"):

            data = {}
            data["店舗名"] = store.h3.get_text(strip=True)

            for tr in store.select("table.store-card__table > tbody > tr"):
                data[tr.th.get_text(strip=True).rstrip("：")] = tr.td.get_text(
                    strip=True
                )

            result.append(data)

        tag = soup.select_one("head > link[rel=next]")

        print(tag)

        if tag:

            url = tag.get("href")

        else:
            break

        time.sleep(3)

import pandas as pd

df = pd.DataFrame(result)

# 登録数確認
df.shape

df.to_csv("shizuoka.csv", encoding="utf_8_sig")

# 重複確認
df[df.duplicated()]

df

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up