More than 5 years have passed since last update.

scrapy + splashで店舗の緯度経度情報を収集する②

Posted at 2017-02-26

１．はじめに

前回に引き続き、店舗の緯度経度情報を取得します。
今回はMapion電話帳からチェーン展開している企業の店舗情報を取得します。
汎用性を持たせるため、scrapy実行時に引数で以下項目を渡せるようにしています。

・genre：ジャンルID
・category：カテゴリID
・chain_store：チェーン展開企業ID

例えば、餃子の王将の場合、以下のようになります。
genre=M01(グルメ)、category=002(ラーメン・餃子）、chain_store=CA01(餃子の王将)

２．実行環境・環境構築

実行環境・環境構築は前回と同じ。

３．scrapy

item.py、setting.pyの設定も名称/取得項目以外、前回同様なので割愛。

チェーン店舗のトップページ（例）に店舗一覧が載っています。
ただし、このページからは緯度経度情報を取得できないため、各店舗のlink先（例）から店舗名と緯度経度を取得します。

そのため、1店舗ごとにクローリングする必要があるので効率が悪く、前回と比較するとかなり時間がかかります。
（10,000件で12時間くらい）

またMapion電話帳の掲載店舗数上限が10,000店舗（100店舗/page * 100page)のため、
10,000店舗以上展開している企業の場合、全店舗を網羅できません。

MapionSpider.py

# -*- coding: utf-8 -*-

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest
from ..items import MapionspiderItem

class MapionSpider(CrawlSpider):
    name = 'Mapion_spider'
    allowed_domains = ['mapion.co.jp']

    # 引数でジャンル、カテゴリー、チェーン店のID情報を受け取る。
    def __init__(self, genre=None, category=None, chain_store=None, *args, **kwargs):
        super(MapionSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.mapion.co.jp/phonebook/{0}{1}{2}/'.format(genre,category,chain_store)]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_item, args={'wait': 0.5}, )

    def parse_item(self, response):
        item = MapionspiderItem()
        shop_info = response.xpath('//*[@id="content"]/section/table/tbody')
        if shop_info:
            item['name'] = shop_info.xpath('tr[1]/td/text()').extract()
            url_path = shop_info.xpath('//a[@id="spotLargMap"]/@href').extract()
            # 店舗の地図情報(url_path)から緯度経度情報を抜き出し、list型にしてitemに渡します。
            url_elements = url_path[0].split(',')
            item['latitude'] = [url_elements[0][4:]]
            item['longitude'] = [url_elements[1]]
            yield item

        # 店舗一覧から各店舗詳細のlink先を取得する。
        list_size = len(response.xpath('//table[@class="list-table"]/tbody/tr').extract())
        for i in range(2,list_size+1):
            target_url_path = '//table[@class="list-table"]/tbody/tr['+str(i)+']/th/a/@href'
            target = response.xpath(target_url_path)
            if target:
                target_url = response.urljoin(target[0].extract())
                yield SplashRequest(target_url, self.parse_item)

        # 店舗一覧の下部にある次のページ番号のlink先を取得する。
        next_path = response.xpath('//p[@class="pagination"]/*[contains(@class, "pagination-currnet ")]/following::a[1]/@href')
        if next_path:
            next_url = response.urljoin(next_path[0].extract())
            yield SplashRequest(next_url, self.parse_item)

４．実行

引数は-aオプションで指定します。
セブンイレブンの店舗情報を取得する場合。

scrapy crawl MapionSpider -o hoge.csv -a genre='M02' -a category='005' -a chain_store='CM01'

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up