More than 5 years have passed since last update.

Scrapyでパイプライン処理コードをspider別にファイル分離する方法

Last updated at 2020-01-11Posted at 2020-01-11

はじめに

Scrapyでスクレイピング対象(spiderファイル)を増やしていった場合、パイプライン処理コードが同一ファイル内に増加してしまい見通しが悪くなり保守性に問題が出てきた。
最終的にはspider別にパイプライン実装ファイルを分離する事が出来たので方法を紹介する。

暫定対象方法

Scrapyのプロジェクトにはsetting.pyという設定ファイルが存在する。
setting.py内にITEM_PIPELINESの設定項目がありspiderが複数存在しても、
ここで指定した単一の実装ファイル内のクラスにパイプライン処理を集約するしかないと当初思っていた。

setting.py

ITEM_PIPELINES = {
    'example_project.pipelines.DBPipeline': 100,
}

siperの名称で処理分岐するパターン

spider名称をキーにルーティングしていたが、spiderが増加していくとコードの見通しが悪くなる事が明白。

pipelines.py

class DBPipeline(object):
    def process_item(self, item, spider):
        if spider.name in ['example_spider']:
            # example_spiderのパイプライン処理
        
        if spider.name in ['example_spider2']:
            # example_spider2のパイプライン処理

結論

下記の様にspider毎にcustom_settingsでITEM_PIPELINES項目を設定すると、
パイプライン処理の実装ファイルを個別にする事が出来る。¹

example_spider.py

class ExampleSpider(scrapy.Spider):
    custom_settings = {
        'ITEM_PIPELINES': {
            'example_project.example_pipelines.ValidationPipeline': 100,
            'example_project.example_pipelines.DBPipeline': 200,
        }
    }

example_spider2.py

class ExampleSpider２(scrapy.Spider):
    custom_settings = {
        'ITEM_PIPELINES': {
            'example_project.example_pipelines2.DBPipeline': 100,
        }
    }

custom_settingsで設定した通り以下のパイプライン処理に個別ルーティングされる。

example_pipelines.py

class ValidationPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider.pyを実行した時に処理される

class DBPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider.pyを実行した時に処理される

example_pipelines2.py

class DBPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider2.pyを実行した時に処理される

以上でスクレイピング対象（spider）が増加してもパイプライン処理コードの見通しを良い状態で保てるようになった。

同様にSPIDER_MIDDLEWARESなど別の項目についてもカスタム設定出来ると思われます。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up