More than 3 years have passed since last update.

【Python】Google画像検索からオリジナル画像を根こそぎダウンロード

Last updated at 2020-11-25Posted at 2020-11-22

はじめに

以前、Google画像検索からサムネ画像をダウンロードする方法を書きましたが、
この度、画像を詳細表示（検索結果を1度クリックした状態）すれば、元画像のリンクがページソース上に記載される、ということが判明しました。
この仕様を生かしてオリジナル画像をダウンロードしていきましょう（Googleがつよいからって負担はかけすぎないように注意しましょう。）

プログラムの流れ

SeleniumでGoogle画像検索
　　↓
1つ目の画像を詳細表示後、右カーソルキーを押下しまくる
　　↓
オリジナル画像のリンクを取得
　　↓
ダウンロード

実装

Selenium, requestsあたりはインストールしてなかったらしてください。
ChromeDriverは実行パスにある設定になってるんで、適宜書き換えてください（インポート文のすぐ下）。
リファクタリングとかはしてないので汚いですが勘弁。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
import requests
import re
import urllib.request
import os
from tqdm import tqdm
DRIVER_PATH = 'chromedriver.exe'

options = Options()
options.add_argument('--disable-gpu')
options.add_argument('--disable-extensions')
options.add_argument('--proxy-server="direct://"')
options.add_argument('--proxy-bypass-list=*')
options.add_argument('--start-maximized')
# ↓ うまくスクロールとか出来なかったときに手動で対処できるように表示しておいたほうが良さげ
# options.add_argument('--headless')


def search():
    global driver, actions
    driver = webdriver.Chrome(executable_path=DRIVER_PATH,
                              chrome_options=options)
    actions = ActionChains(driver)
    url = "https://www.google.com/search?q=" + '+'.join(
        query.split()) + "&safe=off&hl=ja&source=lnms&tbm=isch&sa=X"
    driver.get(url)

    while not driver.find_elements_by_class_name("wXeWr.islib.nfEiy.mM5pbd"):
        time.sleep(.5)
    driver.find_element_by_class_name("wXeWr.islib.nfEiy.mM5pbd").click()


def getLinks():
    global srcs
    more = driver.find_element_by_class_name("mye4qd")
    end = driver.find_element_by_class_name("OuJzKb.Yu2Dnd")
    for i in range(100):
        actions.key_down(Keys.ARROW_RIGHT)
    cnt = 1
    while not (more.is_displayed() or end.is_displayed()):
        if cnt % 5 == 0:
            if len(
                    re.findall("imgurl=(.+?)&amp;",
                               urllib.parse.unquote(
                                   driver.page_source))) > max_num + 5:
                break
        driver.execute_script(
            "window.scrollTo(0, document.body.scrollHeight);")
        actions.perform()
        time.sleep(1)
        cnt += 1

    if more.is_displayed(): more.click()
    while not end.is_displayed():
        if cnt % 5 == 0:
            if len(
                    re.findall("imgurl=(.+?)&amp;",
                               urllib.parse.unquote(
                                   driver.page_source))) > max_num + 5:
                break
        driver.execute_script(
            "window.scrollTo(0, document.body.scrollHeight);")
        actions.perform()
        time.sleep(1)
        cnt += 1
    for _ in range(5):
        actions.perform()
        time.sleep(1)
    srcs = re.findall("imgurl=(.+?)&amp;",
                      urllib.parse.unquote(driver.page_source))
    driver.close()


def download():
    filename = '_'.join(query.split())
    while True:
        if not os.path.exists(filename):
            os.mkdir(filename)
            break
        else:
            filename += "_"

    for i, src in enumerate(tqdm(srcs[:max_num])):
        ext = src[-4:] if src[-4:] in ('.jpg', '.png', '.gif') else '.png'
        with open(f"{filename}\\{filename}{i}{ext}", "wb") as f:
            try:
                f.write(requests.get(src).content)
            except:
                try:
                    with urllib.request.urlopen(src) as u:
                        f.write(u.read())
                except:
                    continue


if __name__ == "__main__":
    query = input("Search:  ")
    max_num = int(input("何枚ダウンロードしますか？（最大）"))
    print("Searching...")
    search()
    print("Done.")
    print("Getting links...")
    getLinks()
    print("Done.")
    print("Now downloading...")
    download()
    print("Done.")

少し時間がかかります。
検索して表示される分以上の画像は当然ダウンロード出来ませんので最大でだいたい500枚くらいになるのかな（いろんなワードで試してないので未知数です）。

さいごに

ほどほどに使用してください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up