More than 5 years have passed since last update.

機械学習用に画像を集めたい

Last updated at 2018-11-21Posted at 2018-11-21

hellock/icrawler: A multi-thread crawler framework with many builtin image crawlers provided.

自分でクローラーを書いてもいいのですが、今回はライブラリを使いました。

pipのインストール

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py --user
export PATH="$HOME/Library/Python/2.7/bin:$PATH"
pip install icrawler --user

確認

$ pip -V
pip 18.1

$ pip install matplotlib

~/.matplotlib/matplotlibrcの設定

backend : TkAgg

plot.py

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 8]

plt.plot( x, y )
plt.show()

$ python plot.py

使い方

# coding:utf-8

from icrawler.builtin import GoogleImageCrawler
    
crawler = GoogleImageCrawler(storage={"root_dir": "images"})
crawler.crawl(keyword="猫", max_num=100)

実行すると以下のようにダウンロードが始まります。

2018-11-21 13:43:26,450 - INFO - icrawler.crawler - start crawling...
2018-11-21 13:43:26,450 - INFO - icrawler.crawler - starting 1 feeder threads...
2018-11-21 13:43:26,451 - INFO - feeder - thread feeder-001 exit
2018-11-21 13:43:26,451 - INFO - icrawler.crawler - starting 1 parser threads...
2018-11-21 13:43:26,452 - INFO - icrawler.crawler - starting 1 downloader threads...
...
...

マニュアル

ちょっと変えた

nogizaka.py

# coding:utf-8

from icrawler.builtin import GoogleImageCrawler
import sys
import os

argv = sys.argv

if not os.path.isdir(argv[1]):
    os.makedirs(argv[1])


crawler = GoogleImageCrawler(storage = {"root_dir" : argv[1]})
crawler.crawl(keyword = argv[2], max_num = 1000)

乃木坂の画像を集めた

$ python nogizaka.py image/nogizaka 乃木坂46

他も試した

次回

集めた画像を使ってDeepLearningモデルを学習させたりできそう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up