More than 5 years have passed since last update.

簡単！30行で書くGoogle画像検索のスクレイピング

Posted at 2019-07-28

やること

以前の記事でFlickr APIを使って画像データをダウンロードしましたが、
今回はGoogle画像検索を使い、画像データのダウンロードを実行したいと思います。
Selenium WebDriverを使ってChromeを起動し、Google画像検索画面で検索を実行する操作を自動化しています。

手順概要

1. ライブラリのインポート
1. Google画像検索の実行
1. 検索結果をスクレイピング
1. 画像データのダウンロード

動作環境

macOS Catalina 10.15 beta
google chrome 75.0.3770.142
Python 3.6.8
beautifulsoup4 4.8.0
selenium 3.141.0
python-chromedriver-binary 2.38.0
lxml 4.3.4
requests 2.21.0

1. ライブラリのインポート

下記のライブラリをインポートします

ライブラリ

import requests
import os, time, sys
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

2. Google画像検索の実行

Chromeを起動し、Google画像検索の画面を開きます
その後、検索フォームにキーワードを入力し、
その後エンターキーを押下するように設定します（キーワードはプログラム実行時に指定）
find_element_by_nameに"q"を指定します(検索フォームのname属性は以下の通り確認)

2.Google画像検索の実行

# launch chrome browser
driver = webdriver.Chrome()
# google image search
driver.get('https://www.google.co.jp/imghp?hl=ja&tab=wi&ogbl')
# execute search
keyword = sys.argv[1]
driver.find_element_by_name('q').send_keys(keyword, Keys.ENTER)

3. 検索結果をスクレイピング

検索結果画面のURLを指定し、htmlの情報を取得します
BeautifulSoupを使って解析します。HTMLパーサーにはlxmlを指定しました
10個のimgタグを取得します

3.検索結果をスクレイピング

current_url = driver.current_url
html = requests.get(current_url)
bs = BeautifulSoup(html.text, 'lxml')
images = bs.find_all('img', limit=10)

imagesにはimgタグが以下のように格納されます

[<img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ1" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ2" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ3" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ4" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ5" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ6" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ7" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ8" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ9" width="XXX"/>,
 <img alt="「XXX」の画像検索結果" height="XXX" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:XXXYYYZZZ10" width="XXX"/>]

4. 画像データのダウンロード

はじめに、保存するディレクトリを作成します
ディレクトリの名前は検索キーワードになります
画像データを取得するため、imgタグ内のsrc属性を取得(3.でimagesの内容参照)
requests.getで画像データを取得します
with openで検索キーワードのディレクトリに連番(開始インデックスは1)で書き込む宣言をします
f.writeでresponce(画像データ)を書き込みます
1ファイルをダウンロードした後は、サーバー側の負荷を抑えるためにリクエストの間隔を空けるために1秒空けます(time.sleepで1秒間停止)
終了したらブラウザを閉じます

4.画像データのダウンロード

os.makedirs(keyword)
WAIT_TIME = 1

for i, img in enumerate(images, 1):
    src = img.get('src')
    response = requests.get(src)
    with open(keyword + '/' + '{}.jpg'.format(i), 'wb') as f:
        f.write(response.content)
    time.sleep(WAIT_TIME)

driver.quit()

コードの実行

実行時に検索キーワードを指定する(XXXX部分)

コードの実行

$ python py_scrayping.py XXXX

ソースコード全体

https://github.com/hiraku00/py_scrayping

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up