More than 5 years have passed since last update.

Scrapy の turorial をやってみた

Last updated at 2017-11-06Posted at 2017-11-05

公式のTutorialをやってみました。Scrapyインストール済みの前提。

流れはこんなかんじだそうです。

新しいScrapyプロジェクトを作成
websiteをクロールしてデータを抽出するための spider と呼ばれるコードを書く
コマンドラインを使ってデータをエキスポートする
spiderを再帰的にリンクを辿ってくれるように変更する
spider引数を使う

プロジェクトの作成

 $ scrapy startproject tutorial

でプロジェクト作成。すると以下のようなディレクトリ構成が作られる。

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Spiderを書く

クローラのメインの処理的なものを書きます。Tutorialでは名言の引用サイトをクロール＆スクレイプしています。

# -*- coding: utf-8 -*-
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
                 'http://quotes.toscrape.com/page/1/',
                 'http://quotes.toscrape.com/page/2/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

Spiderを走らせる

プロジェクトのトップディレクトリに行って、次のコマンドをうつ。

$ scrapy crawl quotes

start_requests メソッドを省略する書き方

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

名言とその作者（？）のデータを抽出する

$ scrapy shell url

で遊んでみると分かるように、quoteの本文と作者は

>>> quote = response.css('div.quote')[0]
とするならば、それぞれ

>>> text = quote.css('div.text::text').extract_first()
>>> author = quote.css('small.author::text').extract_first()

のようにして取り出せる。これを繰り返し処理を用いてそれぞれのquote について行うと、

 def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

のようになる。

クロール＆スクレイピングしたデータの保存

最もシンプルな方法は、次のコマンドである。

$ scrapy crawl quotes -o quotes.json

リンクを辿る処理

以下のようにタグで囲われているリンクを辿る。

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

extractして、href=のリンクの部分だけがほしいので、次のようにして抜き出す。 aタグのhref attributeを抜き出している。

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

spiderコードを再帰的にリンクを辿るように変更したものが以下である。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

最後の行の callback=self.[method_name]の部分を任意のメソッド名に変更すれば、parse方法の違うページであっても処理を書ける。たとえば、parse_child()メソッドなどを書けば良い。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up