More than 5 years have passed since last update.

Python + Scrapyで画像を巡回取得する

Last updated at 2017-12-06Posted at 2017-12-06

この記事について

Python用のスクレイピングフレームワークであるScrapyを使用します
Scrapyを使用して、Webサイトをクローリングして、画像を取得、保存する方法のメモです
保存するファイル名はSHA1 hashではなく、ちゃんと個別にファイル名を付けて保存します
画像保存の方法にフォーカスしたいため、フィルタ設定などはしません。また、以下のような単純なサイトにのみ対応します
- リンクは素直に、htmlファイルに対して<a>タグで実装されている
- 画像は<img>タグで実装されている
この記事では、ヤフーニュースのトップからすべてのページ、画像を巡回する例を挙げます

Scrapyプロジェクトを作る

scrapy startproject test_scrapy
cd test_scrapy

Spiderを用意して実装する

下記コマンドで、save_yahoo_image.pyを生成します。

scrapy genspider save_yahoo_image news.yahoo.co.jp

下記のように実装します。実装内容としては、http://https://news.yahoo.co.jpからすべてのリンクを巡回しながらパース(parse_pageにつっこむ)します。
parse_pageでは、取得したページ内のすべての<img>タグのsrcを取得して画像のURLを取得します。その他、必要な情報を後で用意するImageItemクラスに入れます。
ImageItemクラスには、画像を保存するディレクトリ名(image_directory_name、ここでは、URLのファイル名の一つ上のアドレスとしている)と、画像のURLリスト(image_urls)を格納します。ImageItemクラスは後で実装します。

save_yahoo_image.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from test_scrapy.items import ImageItem

class SaveYahooImageSpider(CrawlSpider):
    name = 'save_yahoo_image'
    allowed_domains = ["news.yahoo.co.jp"]
    start_urls = ["https://news.yahoo.co.jp"]

    rules = (
        Rule(LinkExtractor(allow=( )), callback="parse_page", follow=True),
    )

    def parse_page(self, response):
        print("\n>>> Parse " + response.url + " <<<")
        # print(response.url.rsplit("/", 1)[0])
        item = ImageItem()
        item["image_directory_name"] = self.start_urls[0].rsplit("/", 1)[1]
        item["image_urls"] = []
        for image_url in response.xpath("//img/@src").extract():
            if "http" not in image_url:
                item["image_urls"].append(response.url.rsplit("/", 1)[0] + "/" + image_url)
            else:
                item["image_urls"].append(image_url)
        # print(vars(item))
        return item

ImageItemを実装する

items.pyにImageItemクラスを追加します。格納するフィールドを定義しているだけです。

items.py

# -*- coding: utf-8 -*-

import scrapy
from scrapy.item import Item, Field

class ImageItem(Item):
    image_directory_name = Field()
    image_urls = Field()
    images = Field()

Imageパイプラインの設定をする

先ほど実装したparse_page処理が終わった後に、画像を保存するような設定をします。
まず、settings.pyに以下の2行を追加します。後で作成するMyImagesPipelineクラスを使用し、画像の保存先を./savedImagesにするという指定をします。

settings.py

ITEM_PIPELINES = {'test_scrapy.pipelines.MyImagesPipeline': 1}
# ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = './savedImages'

pipelines.pyに画像を保存するための処理を追加します。これは必須ではなく、標準のImagesPipelineを使用しても良いです(そのときは、上述のsettings.pyのコメントアウトを切り替える)。ただし、ImagesPipelineだと保存される画像ファイルのファイル名がSHA1のハッシュ値になってしまうという問題があります。それだとわかりづらいので、指定したディレクトリ下に、URLの最後の部分(XXX.jpg)をファイル名として保存するようにしたクラスMyImagesPipelineを用意して、それを使います。

pipelines.py

# -*- coding: utf-8 -*-

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.misc import md5sum

# refer: https://stackoverflow.com/questions/31779995/how-to-give-custom-name-to-images-when-downloading-through-scrapy
class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url, meta={'image_directory_name': item["image_directory_name"]})

    def image_downloaded(self, response, request, info):
        checksum = None
        for path, image, buf in self.get_images(response, request, info):
            if checksum is None:
                buf.seek(0)
                checksum = md5sum(buf)
            width, height = image.size
            filename = request._url.rsplit("/", 1)[1]
            path = 'full/%s/%s' % (response.meta['image_directory_name'], filename)
            self.store.persist_file(
                path, buf, info,
                meta={'width': width, 'height': height})
        return checksum

実行する

下記コマンドで実行すると、savedImages\full\news.yahoo.co.jpに画像が保存されていくはずです。

scrapy crawl save_yahoo_image

注意

ヤフーニュースを使用したのは、あくまで例です。本作業はサーバに多大な負荷をかけますので、動作確認出来たら、すぐに停止するようにしてください。本プログラムはフィルタや終了条件を指定していないので、延々とクローリング、ダウンロードを行います。

投稿後に気付いた

何も考えずに記事を書いて、投稿をしてから気づいたのですが、こちらとだいぶ内容がかぶっていました。まあ、本記事は画像の保存方法に特化した内容ということで。。。 (やっぱり皆さん、題材にはヤフーニュースを使うのですね)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up