More than 5 years have passed since last update.

python3でgoogle画像検索から画像を取得する

Posted at 2018-02-15

機械学習ではどうしても結構な数のデータが必要になるため、いわゆるスクレイピングという作業を行って、Web上から画像を拾ってくる必要があります。
pythonでこれを実装したので、メモとして残しておきます。
例によってpythonを活かせてない上に、実は100件程度しか入手できないのであんまり使い物にならないんですが、とりあえず画像欲しい〜って時に…。

ソース全体を丸投げします。

import os
from urllib import request as req
from urllib import error
from urllib import parse
import bs4

keyword ='斑れい岩'
if not os.path.exists(keyword):
    os.mkdir(keyword)

urlKeyword = parse.quote(keyword)
url = 'https://www.google.com/search?hl=jp&q=' + urlKeyword + '&btnG=Google+Search&tbs=0&safe=off&tbm=isch'

headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0",}
request = req.Request(url=url, headers=headers)
page = req.urlopen(request)

html = page.read().decode('utf-8')
html = bs4.BeautifulSoup(html, "html.parser")
elems = html.select('.rg_meta.notranslate')
counter = 0
for ele in elems:
    ele = ele.contents[0].replace('"','').split(',')
    eledict = dict()
    for e in ele:
        num = e.find(':')
        eledict[e[0:num]] = e[num+1:]
    imageURL = eledict['ou']

    pal = '.jpg'
    if '.jpg' in imageURL:
        pal = '.jpg'
    elif '.JPG' in imageURL:
        pal = '.jpg'
    elif '.png' in imageURL:
        pal = '.png'
    elif '.gif' in imageURL:
        pal = '.gif'
    elif '.jpeg' in imageURL:
        pal = '.jpeg'
    else:
        pal = '.jpg'

    try:
        img = req.urlopen(imageURL)
        localfile = open('./'+keyword+'/'+keyword+str(counter)+pal, 'wb')
        localfile.write(img.read())
        img.close()
        localfile.close()
        counter += 1
    except UnicodeEncodeError:
        continue
    except error.HTTPError:
        continue
    except error.URLError:
        continue

やってる内容としては別に難しいことはやっておらず、

keyword ='斑れい岩'
if not os.path.exists(keyword):
    os.mkdir(keyword)

urlKeyword = parse.quote(keyword)
url = 'https://www.google.com/search?hl=jp&q=' + urlKeyword + '&btnG=Google+Search&tbs=0&safe=off&tbm=isch'

headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0",}
request = req.Request(url=url, headers=headers)
page = req.urlopen(request)

でまずは検索語句を指定して、画像格納用のフォルダを作っておきます。
その後は検索語句をURLEncodeにして、google画像検索用のURLにします。
あとはURLで検索して結果を受け取るだけなのですが、ヘッダをいじらないとGoogleに弾かれました。
調べるとFireFoxに偽装するといいと書いていたのでそのようにします(よくわかってません)。


html = page.read().decode('utf-8')
html = bs4.BeautifulSoup(html, "html.parser")
elems = html.select('.rg_meta.notranslate')
counter = 0
for ele in elems:
    ele = ele.contents[0].replace('"','').split(',')
    eledict = dict()
    for e in ele:
        num = e.find(':')
        eledict[e[0:num]] = e[num+1:]
    imageURL = eledict['ou']

    pal = '.jpg'
    if '.jpg' in imageURL:
        pal = '.jpg'
    elif '.JPG' in imageURL:
        pal = '.jpg'
    elif '.png' in imageURL:
        pal = '.png'
    elif '.gif' in imageURL:
        pal = '.gif'
    elif '.jpeg' in imageURL:
        pal = '.jpeg'
    else:
        pal = '.jpg'

    try:
        img = req.urlopen(imageURL)
        localfile = open('./'+keyword+'/'+keyword+str(counter)+pal, 'wb')
        localfile.write(img.read())
        img.close()
        localfile.close()
        counter += 1
    except UnicodeEncodeError:
        continue
    except error.HTTPError:
        continue
    except error.URLError:
        continue

後半部分はBeautifulSoupというパッケージを用いてhtmlの中身をごちゃごちゃやって、表示されてる画像のURLだけ抜き出して、ダウンロードして保存する、という形です。
また、実行時例外が起きたときは完全にスルーです。何もしません。

以上がソースの中身です。
はじめに述べましたが、現在は最大でも100件までしか画像が取得できず、またほとんど必ず何件か実行時例外で溢れるので95件くらいしか集まりません。
pythonでやる意味とかコードの冗長性とか色々問題はあるんですが、とりあえずgoogle画像検索から画像もらいたい、という人は多分これを使えばいい…のか？（自信がない)

ここはこうした方が冗長性無いよとか、この処理間違ってるとか、そもそもこうやったほうが簡単だよ等あればコメントへお願いします。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up