3
6

More than 3 years have passed since last update.

PycharmでScrapyをデバッグ実行してみた。

Last updated at Posted at 2020-08-04

目的

PyCharmでScrapyのデバッグをしたかったので、メモ書き。


実行環境

PyCharm professional
Python3.6
Scrapy2.2.1


使用例:

Scrapyのドキュメントを参照しても以下のように書かれていて、デバッグ実行できない。

scrapy-crawl
$ scrapy crawl myspider
... myspider starts crawling ... 

設定項目

PyCharmのRUn/Debug Configurationsに以下のように設定する。

項目名 説明
Script path venv\Lib\site-packages\scrapy\cmdline.py Scrapyディレクトリのcmdline.pyを指定
Parameters crawl [Spider名]
Run with Python Console チェックする

pycharm_config.png


実行ログ

ソース上にブレイクポイントを入れることも可能。

log
2020-08-04 22:11:24 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: ScrapySample)
2020-08-04 22:11:24 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Windows-10-10.0.18362-SP0
2020-08-04 22:11:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-08-04 22:11:24 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ScrapySample',
 'NEWSPIDER_MODULE': 'ScrapySample.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['ScrapySample.spiders']}
2020-08-04 22:11:24 [scrapy.extensions.telnet] INFO: Telnet Password: f615fed52e270d55
2020-08-04 22:11:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-08-04 22:11:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-04 22:11:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-04 22:11:24 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.images.ImagesPipeline']
2020-08-04 22:11:24 [scrapy.core.engine] INFO: Spider opened
2020-08-04 22:11:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-04 22:11:24 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-04 22:11:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None)
2020-08-04 22:11:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com> (referer: None)
2020-08-04 22:11:24 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-04 22:11:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 438,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 9061,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.354765,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 8, 4, 13, 11, 24, 732365),
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 8, 4, 13, 11, 24, 377600)}
2020-08-04 22:11:24 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0

3
6
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
6