More than 5 years have passed since last update.

scrapy を用いてデータを収集し、mongoDB に投入する

Last updated at 2016-09-20Posted at 2015-10-26

Googleはサーチエンジンの情報収集にGooglebotを使っています。あるウェブサイトを起点に、そのサイトのリンクを自動で辿り、情報を収集します。

pythonの Scrapy モジュールを使えば、同じようなことを実現できます。
Scrapy を用いてサイトの情報を収集してみます。

準備

Scrapyをpipでインストールします。
`$ pip install scrapy

使い方

Scrapyは、プロジェクト単位で管理します。プロジェクトを生成した後、そこで自動生成された下記ファイルを編集していきます。

items.py : 抽出データを定義する
spiders/以下のスパイダー(クローラー)ファイル：巡回、データ抽出条件
pipelines.py　：　抽出データの出力先。今回はmongoDB
settings.py　：　データ巡回の条件 (頻度や、階層など)

プロジェクトの作成

まずはプロジェクトを作成します。
$ scrapy startproject tutorial
すると、このようなフォルダが生成されます。

tutorial/

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py
            ...

抽出データの定義

何を得るかを定義します。データベースで言う、フィールドの定義です。

items.py

import scrapy

class WebItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    date = scrapy.Field()

スパイダーの作成

ここがウェブを巡回しデータを抽出する、花形ファイルです。巡回開始のアドレスと、巡回条件、そしてデータ抽出条件を指定します。

スパイダーの生成

スパイダーを作ります。構文は $ scrapy genspider [options] <name> <domain(巡回するサイトドメイン)>です。

commandline

$ scrapy genspider webspider exsample.com
  Created spider 'webspider' using template 'basic' in module:
  tutorial.spiders.webspider

生成されたファイルは

tutorial/spiders/webspider.py

# -*- coding: utf-8 -*-
import scrapy

class WebspiderSpider(scrapy.Spider):
    name = "webspider"   # プロジェクト内での名前。動かすときのスパイダー指定で使われる
    allowed_domains = ["exsample.com"] # 巡回OKのドメイン指定
    start_urls = (
        'http://www.exsample.com/', # ここを起点にする。リストで複数指定できる。
    )

    def parse(self, response):  # ここが抽出条件
        pass

このようなファイルが生成されます。
これを、自分好みに変えます。

tutorial/spiders/webspider.py

# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import WebItem
import re
import datetime
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class WebspiderSpider(CrawlSpider):  #クラス名にたいした意味はない
    name = 'WebspiderSpider'  # これは重要。この名前を指定してスパイダー(クローラー)を動かす
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    xpath = {
        'title' : "//title/text()",
    }

    list_allow = [r'(正規表現)'] #この条件に合うリンクは巡回
    list_deny = [
                r'/exsample/hogehoge/hoge/', # こちらは巡回しないリンクの指定例。リスト表記も可能
            ]
    list_allow_parse = [r'(正規表現)']  #データ抽出するリンク指定
    list_deny_parse = [                #データ抽出しないリンク指定
                r'(正規表現)',
                r'(正規表現)',
                ]

    rules = (
        # 巡回ルール。
        Rule(LinkExtractor(
            allow=list_allow,
            deny=list_deny,
            ),
            follow=True # そのリンクへ入っていく
        ),
        # データ抽出ルール
        Rule(LinkExtractor(
            allow=list_allow_parse,
            deny=list_deny_parse,
            unique=True # おなじリンク先ではデータ抽出しない
            ),
            callback='parse_items' # 条件に合えば、ここで指定したデータ抽出実行関数を実行する。
        ),
    )

   #データ抽出関数定義
   def parse_items(self, response): # response に、ウェブサイトの情報が入っている
        item = WebItem()  # items.pyで指定したクラス
        item['title'] = response.xpath(self.xpath['title']).extract()[0]
        item['link'] = response.url
        item['date'] = datetime.datetime.utcnow() + datetime.timedelta(hours=9) # 現在時間。日本時間にして突っ込む。

        yield item

書いている内容はコメントを参考にしてください。

pipeline.pyの編集

上記で作成したスパイダーから yield item を、mongoDB に突っ込みます。

pipelines.py

from pymongo import MongoClient  # mongoDB との接続
import datetime

class TutorialPipeline(object):

    def __init__(self, mongo_uri, mongo_db, mongolab_user, mongolab_pass):
        # インスタンス生成時に渡された引数で、変数初期化
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.mongolab_user = mongolab_user
        self.mongolab_pass = mongolab_pass

    @classmethod  # 引数にクラスがあるので、クラス変数にアクセスできる
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'), # settings.py て定義した変数にアクセスする
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items'),
            mongolab_user=crawler.settings.get('MONGOLAB_USER'),
            mongolab_pass=crawler.settings.get('MONGOLAB_PASS')
        ) # def __init__ の引数になる

    def open_spider(self, spider): # スパイダー開始時に実行される。データベース接続
        self.client = MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.db.authenticate(self.mongolab_user, self.mongolab_pass)

    def close_spider(self, spider): # スパイダー終了時に実行される。データベース接続を閉じる
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].update(
            {u'link': item['link']},
            {"$set": dict(item)},
            upsert = True
        ) # linkを検索して、なければ新規作成、あればアップデートする

        return item

いろいろ書いていますが、要はデータベースを開けて、データ突っ込んで、終わったら閉じているだけです。

settings.py

まずは、pipelines.py で呼び出している各種変数を定義します。

settings.py

MONGO_URI = 'hogehoge.mongolab.com:(port番号)'
MONGO_DATABASE = 'database_name'
MONGOLAB_USER = 'user_name'
MONGOLAB_PASS = 'password'

これはmongolabの例です。
settings.pyでは他に、挙動を指定します。

settings.py

REDIRECT_MAX_TIMES = 6
RETRY_ENABLED = False
DOWNLOAD_DELAY=10
COOKIES_ENABLED=False

ここでは、リダイレクトの最大回数を6回に、リトライを実行しないように、ウェブサイトへのアクセスは10秒ごとに、クッキーは保存しないように、という条件を設定しています。
DOWNLOAD_DELAYを指定しないと、バシバシ全力でアクセスしまくるので、巡回先のサイトに大きな負荷がかかります。やめましょう。

実行

実行してみましょう。

commandline

$ scrapy crawl WebspiderSpider

どんどんリンクを辿っていき、条件にあうリンクからデータが抽出されていきます。

106

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up