More than 1 year has passed since last update.

scrapy-s3pipeline利用時にスクレイピングしたアイテムが最後の要素で上書きされてしまう

Posted at 2022-12-23

環境・前提条件

python:3.9.15
Scrapy:2.7.1
scrapy-s3pipeline:0.7.0
※ Dockerコンテナ上で動作

ソースコード

scrapyやscrapy-s3pipelineのお作法に則って、下記の通り記載していました。

spider.py

class MySpider(CrawlSpider):
    name = "my-spider"
    allowed_domains = ["xxx.jp"]
    start_urls = ("https://xxx.jp")
    rules = [
        Rule(LinkExtractor(allow=r'/yyy/'), callback='myparser')
    ]
    #(中略)

    def myparser(self, response):
        my_item = MyCrawlerItem()
        items = response.css(".class")
        for item in items:
            item_info = item.css(".class2")
            for item_detail in item_info:
                my_item["name"] = item_detail.css("div::text").get()
                tbodys = item.css("table tbody")

                for tbody in tbodys:
                    my_item["content"] = tbody.css("li::text").get()
                    yield my_item

items.py

class MyCrawlerItem(scrapy.Item):
    name = scrapy.Field()
    content = scrapy.Field()

settings.py

ITEM_PIPELINES = {
    's3pipeline.S3Pipeline': 100
}

S3PIPELINE_URL = 's3://property-info-storage/{name}/{time}/items.{chunk:07d}.jl.gz'

問題

scrapy-s3pipelineを利用し、
Amazon S3にスクレイピングしたitemをJsonLine形式でアップロードしたのですが、
アップロードしたjson itemが上書きされてしまいました。

sample.jl

#期待記載内容
{“a”:1,”b”:2}
{“c”:3,”d”:4}

#確認された記載内容
{“c”:3,”d”:4}　←上書きされてしまっている
{“c”:3,”d”:4}

原因

s3pipelineにpushしたアイテムは、複数のアイテムをsample.jlとして纏めて出力するために、一時的にリストに溜め込んでいます。
しかしながら、アイテムがミュータブルな型(dictなど)の場合、
上記のコードだと、同じIDのアイテムをパイプラインに渡してしまうため、新しいアイテムを登録するたびに古いアイテムが上書きされてしまう現象が起きてしまいます。

s3pipeline.pipeline.py

    def process_item(self, item, spider):
        """
        Process single item. Add item to items and then upload to S3/GCS
        if size of items >= max_chunk_size.
        """
        self._timer_cancel()
        self.items.append(item) #itemのオブジェクトIDが以前のitemと同じ場合、問題となる。
        if len(self.items) >= self.max_chunk_size:
            self._upload_chunk()

対策

spider.pyを以下のように書き換えることによって、上書きを防ぐことができました。copy.deepcopy()を使用します。

spider.py

import copy #新規追加

class MySpider(CrawlSpider):
    #(中略)

    def myparser(self, response):
        my_item = MyCrawlerItem()
        items = response.css(".class")
        for item in items:
            item_info = item.css(".class2")
            for item_detail in item_info:
                my_item["name"] = item_detail.css("div::text").get()
                tbodys = item.css("table tbody")

                for tbody in tbodys:
                    my_item["content"] = tbody.css("li::text").get()
                    copy_item = copy.deepcopy(my_item) #新規追加
                    yield copy_item #変更

参考

↓コピーしたオブジェクトの上書き問題について
https://qiita.com/kokorinosoba/items/e9ab9398af5b44d2ac9a

↓scrapy-s3pipeline
https://github.com/orangain/scrapy-s3pipeline

↓scrapyの書き方
https://scrapy-docs-ja.readthedocs.io/ja/latest/index.html

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up