More than 5 years have passed since last update.

Scrapyでアニメのデータをスクレイピング②

Last updated at 2018-12-21Posted at 2018-07-25

この記事では何がわかる？

・scrapyによる基本的なクローラの作り方
・scrapyとMongoDBの連結法

これは何をする記事？

spider(クローラ)の紹介をします。
前回の記事の続きです。前回では、xpathによりスクレイピングをすることができるようになりました。今回では、複数のWebページに対して自動的にスクレイピングをすることができるようにします。
コードはgithubにあげてあります。

spiderの概要

spiderはWebを渡り歩きながらスクレイピングをします。

spiderの機能は
①Webページをスクレイピングすること
②Webページを移動すること
の２つです。
では、コードを見ていきます。

def parse(self, response):
        if self.num == self.limit_id: 
            pass
        else:
            url = 'http://anikore.jp/anime/' + str(self.num) +  '/'


            anime_id = self.num
            title = response.xpath('//*[@id="clm24"]//h2/a[@class="blk_lnk"]/text()').extract_first()
            point = response.xpath('//*[@id="main"]/div[2]/div[2]/div[1]/div[1]/span[2]/text()').extract()
            point_story = response.xpath('//*[@id="main"]/div[2]/div[2]/div[1]/div[2]/span[2]/text()').extract()
            point_animation = response.xpath('//*[@id="main"]/div[2]/div[2]/div[1]/div[3]/span[2]/text()').extract()
            point_vc = response.xpath('//*[@id="main"]/div[2]/div[2]/div[1]/div[4]/span[2]/text()').extract()
            point_music = response.xpath('//*[@id="main"]/div[2]/div[2]/div[1]/div[5]/span[2]/text()').extract()
            point_chara = response.xpath('//*[@id="main"]/div[2]/div[2]/div[1]/div[6]/span[2]/text()').extract()
            total_point = response.xpath('//*[@id="main"]/div[2]/div[2]/div[2]/div[1]/span[2]/text()').extract()
            review_num = response.xpath('//*[@id="main"]/div[2]/div[2]/div[2]/div[2]/span[2]/text()').extract()
            fav_num = response.xpath('//*[@id="main"]/div[2]/div[2]/div[2]/div[3]/span[2]/text()').extract()
            ranking = response.xpath('//*[@id="main"]/div[2]/div[2]/div[2]/div[4]/span[2]/text()').extract()
            summary = response.xpath('//*[@id="main"]/div[2]/div[3]/blockquote/text()').extract()
    	    
            print("____________________________________________________")
            print(self.num)
            
            self.num += 1


            # if anime page exist
            if title is not None:


                #output 
                yield {"anime_id":anime_id,"title":title,"point":point,"point_story":point_story,"point_animation":point_animation, \
                "point_vc":point_vc,"point_music":point_music,"point_chara":point_chara, \
                "total_point":total_point, "review_num":review_num, "fav_num":fav_num, \
                "ranking":ranking, "summary":summary}

                # crawl next anime page
                next_url = 'http://anikore.jp/anime/' + str(self.num) +  '/'
                yield Request(next_url, callback=self.parse, dont_filter=True)
            # If animepage does not exist, redirect to homepage. So, title is None.
            else:

                #crawl next anime page
                next_url = 'http://anikore.jp/anime/' + str(self.num) +  '/'
                yield Request(next_url, callback=self.parse, dont_filter=True)

コードの要点
①今回のURLはhttps://www.anikore.jp/anime/[アニメのid]/というルールがあるので
アニメのidをインクリメントしていくことですべてのアニメのページにアクセスする方法をとっています。
②response.xpath()でいろいろスクレイピングしています。
③yield {yield {"anime_id":anime_id,"title":title,...,}でスクレイピングしたものを出力
④yield Request(next_url, callback=self.parse, dont_filter=True)で次のアニメのページに移動

MongoDBとの接続

scrapyにはpipelineという構造があります。scrapyの構造に関しては@checkpointさんのこちらの記事がとてもわかり易いのでぜひ拝見させていただきましょう。
pipeline.pyを次に示します。


from pymongo import MongoClient  # mongoDB との接続
import datetime
from scrapy.conf import settings


class MongoDBPipeline(object):

    def __init__(self):
        # インスタンス生成時に渡された引数で、変数初期化
        connection = MongoClient(
        	settings['MONGODB_SERVER'],
        	settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]


    
    def process_item(self, item, spider):
        self.collection.insert(dict(item))
        return item

MONGODB_*という変数が4つありますが、これはsetting.pyに記述されています。




# mongoDB settings
MONGODB_SERVER = 'localhost'
MONGODB_PORT = 27017
# db name
MONGODB_DB = 'anikore'
# collection name
MONGODB_COLLECTION = "users"

この場合はanikoreという名前のデータベースの中にuserというコレクションが作られ、その中にデータが入っていきます。

実行結果

次のコマンドでスパイダーが走ります。

scrapy crawl anime

同時にcsvファイルに書き出すこともできます。

scrapy crawl anime -o anime.csv

すると、ターミナルでスパイダーが動き回っているのが感じられるはずです。
その後、MongoDBデータが保存されていれば成功です。やったぜ。

更に

今回紹介したスパイダーの書き方を転用すれば、どんなサイトに対しても任意のデータが収集できるような気がします。しかし、そう簡単ではないのが現実。例えば、ユーザのデータはログインしてないとみることができません。つまり、スパイダーの機能としてログインページでログインすることが求められます。これはScrapyのFormRequestで実現できます。

読んでいただきありがとうございました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up