More than 3 years have passed since last update.

【Python】icrawlerで簡単に画像を集めよう！

Posted at 2021-01-07

機械学習の画像集めにicrawlerを利用したのでその紹介です。

icrawlerとは

pythonでwebクローリングを行い、画像を集めるためのフレームワークです。
非常に短いコードを記述するだけで画像を集めることができます。

インストール

pip

$ pip install icrawler

anaconda

$ conda install -c hellock icrawler

使い方

from icrawler.builtin import BingImageCrawler

crawler = BingImageCrawler(storage={"root_dir": './images'})
crawler.crawl(keyword='猫', max_num=100)

root_dirに画像の保存先ディレクトリを指定します。
keywordに集めたい画像のキーワードを指定します。
max_numに集める画像の枚数を指定します。
BingImageCrawlerの部分を他のImageCrawlerに変えることもでき、GoogleやFlickerも利用できます。
- 利用可能なもの → https://icrawler.readthedocs.io/en/latest/builtin.html

Google利用時に`json.decoder.JSONDecodeError`が出る際の対処法

google.pyを見つけます。

例(anaconda利用): C:\Users\hoge\anaconda3\envs\env1\Lib\site-packages\icrawler\builtin\google.py
pip でインストールしている場合はパッケージの場所を検索できるのでそこから辿ってください
- https://qiita.com/t-fuku/items/83c721ed7107ffe5d8ff

google.pyのparseメソッドを下記に変更します。

parseメソッドは144行目あたりにあります。

def parse(self, response):
        soup = BeautifulSoup(
            response.content.decode('utf-8', 'ignore'), 'lxml')
        #image_divs = soup.find_all('script')
        image_divs = soup.find_all(name='script')
        for div in image_divs:
            #txt = div.text
            txt = str(div)
            #if not txt.startswith('AF_initDataCallback'):
            if 'AF_initDataCallback' not in txt:
                continue
            if 'ds:0' in txt or 'ds:1' not in txt:
                continue
            #txt = re.sub(r"^AF_initDataCallback\({.*key: 'ds:(\d)'.+data:function\(\){return (.+)}}\);?$",
            #             "\\2", txt, 0, re.DOTALL)
            #meta = json.loads(txt)
            #data = meta[31][0][12][2]
            #uris = [img[1][3][0] for img in data if img[0] == 1]
            
            uris = re.findall(r'http.*?\.(?:jpg|png|bmp)', txt)
            return [{'file_url': uri} for uri in uris]

参考

https://github.com/hellock/icrawler
https://github.com/hellock/icrawler/issues/65

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【Python】icrawlerで簡単に画像を集めよう！

icrawlerとは

インストール

使い方

Google利用時にjson.decoder.JSONDecodeErrorが出る際の対処法

参考

Google利用時に`json.decoder.JSONDecodeError`が出る際の対処法