More than 1 year has passed since last update.

エロマーケットプレースの研究シリーズ 2: Scrapy でクローラ書く

Last updated at 2022-05-23Posted at 2022-05-23

前回

エロマーケットプレースに興味を持ったところまで

今回は

Scrapy を使ってみる

インストール

$ python -m pip install scrapy

そしたら　virtualenv ディレクトリにコマンドが生えた

$ ./venv/bin/scrapy --help

ふーん。エッチじゃん

さてプロジェクトを作ってみる

とりあえず main_scrapy という名前にしてみる

$ scrapy startproject main_scrapy

こんなディレクトリ構成ができた

$ tree main_scrapy 
main_scrapy
├── main_scrapy
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

ディレクトリに入って作業する

$ cd main_scrapy

設定を書く

以下の行だけ追加した

# ブラウザと同じコンテンツを返して欲しいので
USER_AGENT = <適当に>

# 負荷をかけたくないので
DOWNLOAD_DELAY = 3

# 何度も同じリクエストをしないように
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py でコンテンツがどんな情報を持つかを書く

main_scrapy/items.py に書くっぽい

import scrapy

class Item(scrapy.Item):
    url = scrapy.Field()
    username = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    genre = scrapy.Field()
    tags = scrapy.Field()
    duration = scrapy.Field()
    play_count = scrapy.Field()
    like_count = scrapy.Field()

こんな感じで書いた

Spider を作る

Spider はどのサイトをクロールするかを書くっぽい？

とりあえず、作成

$ scrapy genspider main <対象サイトのドメイン>

main_scrapy/spiders/main.py　にクローラのリクエストの仕方とかを書くっぽい

書いてみる（一部、サイトが特定されそうなところは伏せてある）

import scrapy
import re
import time
import datetime
from main_scrapy.items import Item


class MainSpider(scrapy.Spider):
    name = 'main'
    allowed_domains = ['<クロールしたいサイトのドメイン>']
    start_urls = ['<リストページの最初のURL>'] 

    def parse(self, response):

        # リストページから商品ページへのリンクを取得
        for href in response.css('.content a::attr(href)'):
            item_url = response.urljoin(href.get())
            # 商品ページの読み込み
            yield scrapy.Request(item_url, callback=self.parse_item)

        # 現在ページの　URL からページ番号を取得
        url_regex = r'^<リストページのURLのパターン>&page=(\d+)$'
        url_matches = re.match(url_regex, response.url)
        assert url_matches != None
        current_page_index_text = url_matches.group(1)
        current_page_index = int(current_page_index_text)
        next_page_index = current_page_index + 1

        # 次のページを読む
        if next_page_index <= 1000:
            yield scrapy.Request(f'<リストページのURL>&page={current_page_index + 1}', callback=self.parse)

    def parse_item(self, response):

        # (特定のサイト用のコードなので省略)
        # response からページ内の要素を　css や xpath で抜き出し
        # 正規表現で形成したりする

        # Item の生成
        item = Item()
        item['id'] = item_id
        item['url'] = url
        item['username'] = username
        item['title'] = title
        item['description'] = description
        item['genre'] = genre
        item['tags'] = tags
        item['duration'] = duration
        item['play_count'] = play_count

        # XMLHttpRequest の読み込み
        yield scrapy.Request('<XHR の Endpoint>',
                method='POST',
                body=f'<XHR の parameters>',
                headers={
                    'X-Requested-With': 'XMLHttpRequest',
                    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                    },
                callback=self.parse_item_like_count,
                meta={'item':item} # ここで作った item 
            )

    # XMLHttpRequest の結果が取得できた段階で item を引き継ぐ
    def parse_item_like_count(self, response):
        item = response.meta['item']
        response_body_text = response.body.decode('utf-8')
        assert re.match(r'^\d+$', response_body_text) != None

        item['like_count'] = int(response_body_text)

        # 一つの item を出力
        yield item

これで OK。ディレクトリ構成とかめんどくさそうに見えたけど、結構簡単だった

早速実行してみる

$ scrapy crawl main -o output.csv

結果

csv にデータをぶち込むことができた！

次は

商品データをベクトル化するための前段階として、 Mecab のセットアップをやっていく〜

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up