More than 5 years have passed since last update.

[スクレイピングネタ]あるスーパーの抽出

Last updated at 2020-02-07Posted at 2019-01-16

2020年2月7日に追記；

Google colab上で当該URLのrequestsをしたらエラーになった。

HTTPSConnectionPool(host='gate.aeonsquare.net', port=443): Max retries exceeded with url: /auth/v1/handover?csid=dotcom_pc&nonce=0a9484ca2bb6e89703a4894e29a71f2d091fea504c171e6b7b250ea39a51bb16 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f571c537fd0>: Failed to establish a new connection: [Errno 110] Connection timed out',))

1月18日に追記；
手元にある住所データの緯度経度を求めたいときは下記サービスが便利
http://ktgis.net/gcode/geocoding.html

はじめに

あるスーパーの店舗情報（店の名前と住所と緯度経度）がほしかったのですが、csv等でなかったため、スクレイピングしました。せっかくですし備忘録も兼ねて投稿します。
例によってGoogle Colabを使っています。

ソースコード

main.py


import requests
import pandas as pd
from bs4 import BeautifulSoup
from google.colab import files
import os
import geocoder
from time import sleep
from google.colab import files

url1 = "https://www.aeon.com/store/list/%E4%B9%9D%E5%B7%9E%E5%9C%B0%E6%96%B9/p_"
url2 = "/?q=aeoncom"
cols = ['store_name','address','latlon']
df = pd.DataFrame(index=[],columns=cols)

for i in range(1,17):
  response = requests.get(url1 + str(i) + url2).text
  soup = BeautifulSoup(response, 'html.parser')
  
  for tag in soup.find_all('div', class_="storeInfo"):
    atag_stname = tag.find('a', class_="storeName")
    atag_adname = tag.find('span', class_="address")
    latlon = geocoder.arcgis(atag_adname)
    
    record = pd.Series([atag_stname.text,atag_adname.text,latlon.latlng],index=df.columns)
    df = df.append(record,ignore_index=True)
    sleep(2)
df.to_csv("df_ion.csv")
files.download('df_ion.csv')

ソースコード解説

1.py

import requests
import pandas as pd
from bs4 import BeautifulSoup
from google.colab import files
import os
import geocoder
from time import sleep
from google.colab import files

Google Colabを使っているので、関係のあるものをimportしています。
また、ジオコーディング（住所やランドマークから緯度経度を出す）したいので、geocoderをimportしています。

2.py

url1 = "https://www.aeon.com/store/list/%E4%B9%9D%E5%B7%9E%E5%9C%B0%E6%96%B9/p_"
url2 = "/?q=aeoncom"
cols = ['store_name','address','latlon']
df = pd.DataFrame(index=[],columns=cols)

とあるスーパーのURLを変数へ。
URLを複数に分けているのはp_の部分に連番が入るため、後のfor文で対応するためにurl1とurl2を分けています。
あとでつなげます。

3.py

for i in range(1,17):
  response = requests.get(url1 + str(i) + url2).text
  soup = BeautifulSoup(response, 'html.parser')

  for tag in soup.find_all('div', class_="storeInfo"):
    atag_stname = tag.find('a', class_="storeName")
    atag_adname = tag.find('span', class_="address")
    latlon = geocoder.arcgis(atag_adname)

    record = pd.Series([atag_stname.text,atag_adname.text,latlon.latlng],index=df.columns)
    df = df.append(record,ignore_index=True)
    sleep(2)

とあるスーパーは17ページあったため、forで17回まわします。
responseの部分でurlをつなげる処理をしています。

二回目のforではタグを指定してほしいデータを取得しています。
latlonで行っているのはジオコーディングです。ジオコーディングはいろいろなプロバイダが提供しております。
今回はGISで有名なesriから取得しています。
※geocoder.googleというものもありましたが、今では使えなくなっているみたいです。
　つかうならarcgisかosmがいいとおもいます。

sleepをしているのは、スクレイピングは相手のサーバーに負荷を与えるとのことなので、
一応待機所をさせています。スクレイピングの待ち時間は何秒が適切なのかわかりませんでしたが、
とりあえずここでは2秒。（さすがに少なすぎかも。。。）

4.py

df.to_csv("df_ion.csv")
files.download('df_ion.csv')

作成したdfをローカルにダウンロードしています。

おわりに

スクレイピングは泥臭いけど、なんだかんだ面白い。一気に大量のデータをダウンロードできるので。
スクレイピングした興味あるデータを元に、機械学習やら深層学習やらするのは楽しそう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up