More than 5 years have passed since last update.

グーグル検索の結果100件分をCSVファイルに保存する

Last updated at 2020-05-22Posted at 2020-05-18

やりたいこと

グーグル検索の結果100件分をCSVファイルに保存したい

実装方法

Seleniumを利用してWebスクレイピングを実施する

環境

Windows10
jupyter notebook

chromedriverのダウンロード先

ChromeDriver - WebDriver for Chrome

実装

# Selenium用
from selenium import webdriver
import time
# スクレイピングを実施してcsvに書き出し用
from bs4 import BeautifulSoup
import requests
import lxml
import csv

# Chromedriverを立ち上げる（DLしてローカル環境に指定する）
driver = webdriver.Chrome('C:/Users/user/jupyternb/chromedriver_win32/chromedriver.exe')

# google検索 100件表示分（?num=100）を指定する
driver.get('https://www.google.co.jp/search?num=100');

# 待機
time.sleep(3)

# キーワードの指定
keywords = str('python スクレイピング')

# 検索する
search_box = driver.find_element_by_name('q')
search_box.send_keys(keywords)
search_box.submit()

# 待機
time.sleep(6)

# 検索した結果のURLを取得する
url = driver.current_url

# Chromedriverを止める
driver.quit()

print(url)

# ヘッダーの指定
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}

# 検索結果のURLを指定する
get_url = requests.get(url, headers=headers)
get_url.raise_for_status()

# 取得したURLのソースをスクレイピングする
soup = BeautifulSoup(get_url.text, 'lxml')

# 検索結果のタイトルとリンクを取得する
link_title = soup.select('.r > a')

# 検索結果の説明部分を取得する
link_disc = soup.select('.s > div > .st')

# 検索結果を数える
if(len(link_disc) <= len(link_title)):
    leng = len(link_disc)
else:
    leng = len(link_title)
    
# csvファイルを書き込み用にオープンして整形して書き出す
with open('output/' + '['+ keywords + ']_g_output_2.csv', 'w', newline='', encoding='utf-8') as f:
    csvwriter = csv.writer(f)
    csvwriter.writerow(['タイトル', '説明', 'URL'])
    for num in range(leng):
        # リンクのみを取得し、余分な部分を削除する
        url_text = link_title[num].get('href').replace('/url?q=','')
        
        # タイトルのテキスト部分のみを取得する
        title_text = link_title[num].get_text()
        
        # 説明のテキスト部分のみを取得／余分な改行コードを削除する
        text_1 = link_disc[num].get_text()
        text_2 = text_1.replace('\n', '')
        disc_text = text_2.replace('\r', '')
        csvwriter.writerow([title_text, disc_text, url_text])

結果（先頭から5件分）

参考URL

Google検索結果から、タイトルとURLと説明文だけを抜き取る／PythonでWebスクレイピング
 【Python】Seleniumで検索結果をスクレイピングしてCSV出力する
 Python + Selenium で Chrome の自動操作を一通り

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up