1
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

icrawlerがうまく動かなかった人へ

Posted at

環境

anaconda下で、pythonのバージョンは3.6の仮想環境を作って、そこに対して、conda install経由でjupytericrawlerを入れた。

はじめに

自分の環境ではicrawlerの定番の設定が以下のようになりうまくいかなかった。
参考URL(https://icrawler.readthedocs.io/en/latest/)

icrawler.py
from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(keyword='cat', max_num=100)
結果
2020-04-17 16:27:56,617 - INFO - icrawler.crawler - start crawling...
2020-04-17 16:27:56,618 - INFO - icrawler.crawler - starting 1 feeder threads...
2020-04-17 16:27:56,618 - INFO - feeder - thread feeder-001 exit
2020-04-17 16:27:56,619 - INFO - icrawler.crawler - starting 1 parser threads...
2020-04-17 16:27:56,620 - INFO - icrawler.crawler - starting 1 downloader threads...
2020-04-17 16:27:56,975 - INFO - parser - parsing result page https://www.google.com/search?q=cat&ijn=0&start=0&tbs=&tbm=isch
2020-04-17 16:27:59,010 - INFO - parser - no more page urls for thread parser-001 to parse
2020-04-17 16:27:59,011 - INFO - parser - thread parser-001 exit
2020-04-17 16:28:01,622 - INFO - downloader - no more download task for thread downloader-001
2020-04-17 16:28:01,623 - INFO - downloader - thread downloader-001 exit
2020-04-17 16:28:01,627 - INFO - icrawler.crawler - Crawling task done!

このとき、画像をダウンロードをすることはできていなかった。

動いた設定

参考URL(https://icrawler.readthedocs.io/en/latest/builtin.html)

icrawler.py
from icrawler.builtin import BaiduImageCrawler, BingImageCrawler, GoogleImageCrawler

google_crawler = GoogleImageCrawler(
    feeder_threads=1,
    parser_threads=1,
    downloader_threads=4,
    storage={'root_dir': 'your_image_dir'})
filters = dict(
    size='large',
    color='orange',
    license='commercial,modify',
    date=((2017, 1, 1), (2017, 11, 30)))
google_crawler.crawl(keyword='cat', filters=filters, offset=0, max_num=1000,
                     min_size=(200,200), max_size=None, file_idx_offset=0)

bing_crawler = BingImageCrawler(downloader_threads=4,
                                storage={'root_dir': 'your_image_dir'})
bing_crawler.crawl(keyword='cat', filters=None, offset=0, max_num=1000)

baidu_crawler = BaiduImageCrawler(storage={'root_dir': 'your_image_dir'})
baidu_crawler.crawl(keyword='cat', offset=0, max_num=1000,
                    min_size=(200,200), max_size=None)

これだとうまくいった。
きっと1度動かすことができれば、調整できるだろうから、誰かが自分のようにはまった状態から、とりあえずでも抜け出せることを願っている。

1
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?