More than 3 years have passed since last update.

データ入力がめんどいのでスクレイピングした話

Last updated at 2020-10-15Posted at 2019-08-11

概要

学校の課題でデータ入力が必要でしたがめんどうだったのでスクレイピングをしてみました。
今回はフレームワークのscrapyを利用して横浜DeNAベイスターズの選手のプロフィールを取得。csv に保存を行います。
取得先webサイトはこちらです。　http://npb.jp/bis/teams/rst_db.html

環境

windows10
ubuntu18.04
python3
(環境構築などは省きます)

spider

取得メソッドはxpathは使用しています。


# -*- coding: utf-8 -*-
import scrapy

from baseball.items import BatterItem;

class BatterSpider(scrapy.Spider):
    name = 'DeNA'
    allowed_domains = ['npb.jp']
    start_urls = ['http://npb.jp/bis/teams/rst_db.html']

    def parse(self, response):

       #//＝ノードの省略 @=クラス要素の指定
       for tr in response.xpath('//*[@id="tedivmaintbl"]/div[4]/table').xpath('tr'):
            
            item = BatterItem()
            
            item['number'] = tr.xpath('td[1]/text()').extract_first()
            #if文で値がnullかを判断する
            if not isinstance(tr.xpath('td[2]/a/text()').re_first(r'\w+\s*\w+'),type(None)):
                name_str = tr.xpath('td[2]/a/text()').re_first(r'\w+\s*\w+')
                #全角スペースの削除
                item['name'] = name_str.replace('　', '')
    
            item['day'] = tr.xpath('td[3]/text()').extract_first()
            item['height'] = tr.xpath('td[4]/text()').extract_first()
            item['weight'] = tr.xpath('td[5]/text()').extract_first()
            item['pit'] = tr.xpath('td[6]/text()').extract_first()
            item['bat'] = tr.xpath('td[7]/text()').extract_first()
            
            yield item

items.py

取得したデータを格納するオブジェクトを構成します。


# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

from scrapy import Item, Field

class BatterItem(Item):
    name = Field()      # 名前
    bat = Field()       # 右打ち or 左打ち or 両打ち
    pit = Field()       #右投げ or 左投げ
    number = Field()    #背番号
    day = Field()       #誕生日
    height = Field()    #身長
    weight = Field()    #体重

settings.py

今回設定に追加したのは下記の通り

robots.txtに従う
ダウンロード間隔を空ける
キャッシュを取得し、再実行時はキャッシュで行う。
csvにすると文字化けが発生するので、windowsの仕様に合わせて文字コードを設定


ROBOTSTXT_OBEY = True #robot.txtに従うか
DOWNLOAD_DELAY = 10 #ダウンロード間隔
HTTPCACHE_ENABLED = True #キャッシュを保存するか
HTTPCACHE_EXPIRATION_SECS = 60 * 60 * 24 #キャッシュの保存期間
HTTPCACHE_DIR = 'httpcache' #キャッシュの保存先
FEED_EXPORT_ENCODING='shift_jis' #出力ファイルの文字コード

あとはコマンドを叩けばcsvに保存されて終了です。

まとめ

今回は課題で使うためcsvに保存しましたが、データベース化してデータ分析などしてみたいなーと思ってます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up