Scrapy で画像ファイルをダウンロードする方法

Last updated at 2020-07-29Posted at 2020-07-25

目的

Scrapyで画像ファイルをダウンロードする方法を最低限の実装で実現する方法を記載する。

使用する言語、ソフトウェア

Python3系
Scrapy 1.73
pillow
boto3 ※ファイルをS3に保存する場合のみ必要

処理概要

ダウンロード対象が、画像か画像以外で使用する機能（Item）が異なるが、大筋は同じ。
また、ファイル保存先をローカルはもちろん、FTP、S3、Google Cloud Storageも指定が可能。

残課題

画像ファイル名がURLのハッシュ値（sha1）であるため、
ページごとに画像をまとめたいなどは下記ソースコードのみではできない。

ソースコード

画像ダウンロードにつかうItemsを作成。

items.py

class ScrapyImageItem(scrapy.Item):
    image_urls = scrapy.Field()

クローリング用Spiderを作成。
クローリングしたページ上にあるJPG画像URLを抽出し、Itemに画像URLを設定する。

spiders/image_dl.py

class ImageDlSpider(scrapy.Spider):
    name = 'image_dl'
    allowed_domains = ['exmaple.com']
    start_urls = ['https://exmaple.com/']

    def parse(self, response):
        image_url = [x for x in response.xpath('//img/@src').extract() if x[-4:] == '.jpg']

        item = ScrapyImageItem()
        item['image_urls'] = image_url

        yield item

画像ダウンロードするためのPipeline、画像を保存するためのディレクトリを設定する。

settings.py

# パイプライン
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1
   # 'ScrapySample.pipelines.ScrapysamplePipeline': 300,
}

# 保存ディレクトリ
IMAGES_STORE = "/tmp/img/"

# ダウンロードする画像のサムネイル設定
IMAGES_THUMBS = {
   'small': (100, 100),
   'big': (200, 200),
}

以上の設定後、Spiderを実行することでクローリング先にあるJPG画像を IMAGES_STORE ディレクトリに格納される。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up