More than 3 years have passed since last update.

scrapyとbeautifulsoup4を用いたクローリングアプリ作成#1

Posted at 2021-02-12

背景

クローリングアプリを作成したいと調べていたところ
scrapyを使うと楽だと聞いたのでscrapyとスクレイピングの勉強がてら実装していこうと思いました。

やりたいこと

・複数のフリーランス案件から案件情報を取得
・取得したデータをcsv or DB に格納していく(ここは検討中)
・どの言語需要が高いかを調べるためにヒストグラム化する

本記事内容

scrapyをインストール
scrapyを用いてプロジェクト作成
spider作成
実装(途中まで)

前提

今回は下記サイトを参考にscrapyをインストールしました。
https://qiita.com/Chanmoro/items/f4df85eb73b18d902739

動作環境

Windows10 Home バージョン：20H2
python 3.7.8

1.scrapyをインストール

まずはpipコマンドでscrapyをインストール

py -m pip install scrapy

インストールしたのちにバージョンを確認して結果が見れればインストール完了です。

py -m scrapy version
Scrapy 2.4.1

2.scrapyを用いてプロジェクト作成

プロジェクト作成は簡単で下記コマンドをするだけです。
今回は案件サイトをスクレイピングしたいのでcareerプロジェクトとして作成しておきます。

py -m scrapy startproject career

作成すると下記フォルダ構成で作成されます。
作成できたらプロジェクト作成は終了です。

career
    │  scrapy.cfg
    │
    └─career
        │  items.py
        │  middlewares.py
        │  pipelines.py
        │  settings.py
        │  __init__.py
        │
        ├─spiders
        │  │  levtech.py (3.の手順で生成される)
        │  │  __init__.py
        │  │
        │  └─__pycache__
        │          levtech.cpython-37.pyc
        │          __init__.cpython-37.pyc
        │
        └─__pycache__
                items.cpython-37.pyc
                pipelines.cpython-37.pyc
                settings.cpython-37.pyc
                __init__.cpython-37.pyc

3.spider作成

これでクローリングに必要なプロジェクト作成は完了しました。
あとはページをスクレイピングするファイルを作成するだけです。
これをスパイダーと言うらしい

下記コマンドでスパイダーを作成し、作成されたファイルにスクレイピングするコードを生成していく


cd career
py -m scrapy genspider levtech freelance.levtech.jp

levtechは任意の名前でOKです。
最後のfreelance.levtech.jpはスクレイピングしたいサイトのドメイン名を記載

これで必要なファイルは終了で、あとは実装していくだけです。

4.フリーランス案件サイトからスクレイピング、CSVに記載するコード実装

今回スクレイピングするサイトは下記
https://freelance.levtech.jp/project/search/

編集するべきファイル

いろいろファイルがありますが、いじるファイルは下記4つのみ

items.py
settings.py
piplines.py
levtech.py

items.py

これはスクレイピングで取得する項目を設定するファイルになります。
今回は下記を抽出していきます。

案件タイトル
料金
最寄り駅
契約形態
使用言語
求めるスキル
仕事内容

私が作成したitems.pyは下記です。

items.py

import scrapy


class CareerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    job_title = scrapy.Field()
    fee = scrapy.Field()
    nearest_station = scrapy.Field()
    contract = scrapy.Field()
    language = scrapy.Field()
    skill = scrapy.Field()
    job_content = scrapy.Field()

**scrapy.Field()**とするだけでOKです。

settings.py

これはクローリングする際に必要な設定を記載するファイルです。
必要な設定とはスクレイピングする時間だったり、robots.txtに従うか等になります。

作成したsettings.py

settings.py

# Scrapy settings for career project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'career'

SPIDER_MODULES = ['career.spiders']
NEWSPIDER_MODULE = 'career.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# ユーザーエージェントの名前
USER_AGENT = 'career'

# Obey robots.txt rules
# robots.txtのアクセス制御に従うか
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'career.middlewares.CareerSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'career.middlewares.CareerDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'career.pipelines.CareerPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# ログの出力レベル
LOG_LEVEL = 'INFO'

ITEM_PIPELINES = {
    'career.pipelines.CareerCsvWriterPipeline': 1,
}

最後のITEM＿PIPELINESは取得したデータをcsvに書き込むようにするという設定です。
最後の数字1については優先度を表しています。
この数字が小さいほど優先度が高いと覚えてください。

終わりに

とりあえず今日はここまでにします。
残りのlevetech.pyとpiplines.pyは肝になるので
ここは説明を入れたいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up