More than 3 years have passed since last update.

【Python】グーグル検索結果のタイトルとリンクをCSVファイルに保存（ページ移動有）

Posted at 2021-08-07

はじめに

グーグルの検索結果って時々刻々と変わっていると思うんだよね。
その仮説を調べるどうかは別として、
グーグル検索結果のタイトルとリンクをCSVファイルに保存するようにしたい

仕様

・WebDriverはchromedriver を使用
・ブラウザはChrome ※ヘッドレスモードで実行
・検索エンジンはGoogle
・検索エンジンでの検索結果の対象とするページ数はコード内で指定（とりあえず10ページ）
・ページの移動は検索結果ページ下部の「次へ」のリンクテキストをclickすることで遷移
・ページが指定したページより少ない場合は最終ページに到達後に処理を終了してタイトルとリンクをCSVに保存する※エラーで終了しない
・検索するワードはコード内で指定→「トノサマバッタのローキック」
・ページ間の移動時は１秒待つ

コード

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
import pandas as pd
from time import sleep

MAX_PAGES = 10 # リンクをコピーするグーグル検索結果のページ数
SLEEP_TIME = 1 # 待ち時間[秒]
SEARCH_WORD = "トノサマバッタのローキック" # 検索ワード
SAVE_FILE_NAME = "result.csv"

def get_titles_links(url):
    # サイト内のリンクをリスト化する
    html = requests.get(url)
    soup = BeautifulSoup(html.content, "html.parser")
    links = []
    titles = []
    for element in soup.find_all("a"):
        try:
            title = element.find_all("h3")[0].text
            titles.append(title)
            link_ = element.get("href")
            link_url = urljoin(url, link_)
            links.append(link_url)
        except:
            pass
    return titles, links

# Chromeドライバーのオプション
options = Options()
options.add_argument('--disable-gpu')
options.add_argument('--disable-extensions')
options.add_argument('--proxy-server="direct://"')
options.add_argument('--proxy-bypass-list=*')
options.add_argument('--start-maximized')
options.add_argument('--headless')

# Chromeドライバーの起動
driver = webdriver.Chrome(options=options)

# Googleにアクセスする
url = 'https://google.com/'
driver.get(url)
sleep(SLEEP_TIME)

# Google検索窓にキーワードを入力して検索
selector = "body > div.L3eUgb > div.o3j99.ikrT4e.om7nvf > form > div:nth-child(1) > div.A8SBwf > div.RNNXgb > div > div.a4bIc > input"
element = driver.find_element_by_css_selector(selector)
element.send_keys(SEARCH_WORD)

# enterキーを押して検索
element.send_keys(Keys.ENTER)
sleep(SLEEP_TIME)

all_page_titles = [] #全ページのページタイトルを格納するリスト
all_page_links = [] #全ページのページURLを格納するリスト

# 指定のページ数分だけページを遷移させてタイトルとURLをリストに格納する
for page in range(MAX_PAGES):
    # 検索結果のページのリンクをリストに格納
    titles, links = get_titles_links(driver.current_url)
    
    # 検索結果のページのリンクを全ページのリストに格納
    all_page_titles.extend(titles)
    all_page_links.extend(links)
    
    # 検索結果下部の「次へ」ボタンをクリックしページ遷移する
    try:
        next_button = driver.find_element_by_link_text('次へ')
        next_button.click()
        sleep(SLEEP_TIME)
    except Exception as e:
        print(e)
        print("エラーページ：" + str(page+1))
        driver.quit()
        break

# データフレーム化する前にリストを辞書型のデータにする
result ={"title":all_page_titles,
         "link":all_page_links}

# データフレーム化してCSVファイルに保存
df = pd.DataFrame(result)
with open(SAVE_FILE_NAME,
          mode="w",
          newline="",
          encoding="cp932",
          errors="ignore") as f:
    df.to_csv(f, index=False)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up