451
482

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

機械学習用の画像を集めるのにicrawlerが便利だった

Last updated at Posted at 2018-07-15

hellock/icrawler: A multi-thread crawler framework with many builtin image crawlers provided.

自分でクローラーを書いてもいいのですがめんどうくさいためライブラリを使いました。

インストール

pip install icrawler

使い方

from icrawler.builtin import GoogleImageCrawler
    
crawler = GoogleImageCrawler(storage={"root_dir": "images"})
crawler.crawl(keyword="", max_num=100)

実行すると以下のようにダウンロードが始まります。

2018-07-15 13:20:58,410 - INFO - icrawler.crawler - start crawling...
2018-07-15 13:20:58,411 - INFO - icrawler.crawler - starting 1 feeder threads...
2018-07-15 13:20:58,412 - INFO - feeder - thread feeder-001 exit
2018-07-15 13:20:58,412 - INFO - icrawler.crawler - starting 1 parser threads...
2018-07-15 13:20:58,413 - INFO - icrawler.crawler - starting 1 downloader threads...
2018-07-15 13:20:59,566 - INFO - parser - parsing result page https://www.google.com/search?q=%E7%8C%AB&start=0&tbs=&tbm=isch&ijn=0
2018-07-15 13:21:00,030 - INFO - downloader - image #1  https://sociorocketnews.files.wordpress.com/2018/04/mokichi004.jpg?w=640&h=480
2018-07-15 13:21:00,492 - INFO - downloader - image #2  https://i.ytimg.com/vi/cwFzY49meP0/maxresdefault.jpg
...
...

ダウンロード先のサイトはGoogleだけでなく、Bing、Baiduにも対応しています。

filtersを設定することができる。
大きい画像だけ欲しい場合。

from icrawler.builtin import GoogleImageCrawler
    
crawler = GoogleImageCrawler(storage={"root_dir": "images"})
crawler.crawl(keyword="", max_num=100, filters={
  "size": "large"
})

マニュアル

451
482
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
451
482

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?