pythonでweblioスクレイピング③(google spreadsheetに出力)

Posted at 2025-04-24

断り
これは4年ほど前に書いた記事です。
当時は別のアカウントで公開していましたが、アカウントを移行することになったので、こちらに改めて投稿しています。
内容は学部生の頃に書いたもので、コードもだいぶ拙いところがあると思いますが、その点はご容赦ください。

最終回
スクレイピング結果をgooglespreadsheetに出力できるようにする。

断り

この記事を書こうとした段階でGoogleCloudPlatformの無料トライアル期間が終わってしまい、細かい説明ができなくなってしまいました。参考にした動画とコードだけを載せておきます。

scraping.py

import json
import getpass
import requests
from bs4 import BeautifulSoup
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import json
import pandas as pd
import re
import time
import pprint

N = 250 #取得数

#jsonファイルを使って認証情報を取得
SCOPES = ["https://spreadsheets.google.com/feeds",
          'https://www.googleapis.com/auth/spreadsheets',
          "https://www.googleapis.com/auth/drive.file",
          "https://www.googleapis.com/auth/drive"]
SERVICE_ACCOUNT_FILE = 'jsonfileのpath'
SHEET = 'スクレイピング結果を出力するgooglespreadsheetの名前'

credentials = ServiceAccountCredentials.from_json_keyfile_name(SERVICE_ACCOUNT_FILE,SCOPES)
gs = gspread.authorize(credentials)

SPREADSHEET_KEY = 'sheetのkey(urlの中にあるやつ)'
worksheet = gs.open_by_key(SPREADSHEET_KEY).worksheet(SHEET)


bookmark_path = '\\Users\\PCのユーザ名\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\Bookmarks'

with open(bookmark_path,encoding = 'utf-8_sig') as f:
    bookmark_data = json.load(f)

bookmarks = bookmark_data['roots']['bookmark_bar']['children'][0]['children'][0:N]

def get_weblio_url(bookmark):
    if 'Weblio' in bookmark['name']:#Wは大文字
        return bookmark['url']

urls = filter(lambda url:url is not None, list(map(get_weblio_url,bookmarks)) )
pprint.pprint(urls)

for url in list(urls):
    try:
        r = requests.get(url)
        bsObj = BeautifulSoup(r.content,'lxml')
    except: 
        print("urlが無効だ")

    meaning = []
    err_url = []
    origin = bsObj.find_all("span",{"class":"content-explanation ej"})

    try:
        meanings = ','.join(str(origin).split('  ')[-1].split('<')[0].split('、')[0:3])
    except:
        err_url.append(url)
        print('失敗' + err_url[-1])
        print("htmlが無効だ")
        continue
    eng = url.split('/')[-1]
    eng = re.sub(r"[^a-zA-Z0-9]"," ",eng)

    pprint.pprint(eng + ':' + meanings)
    new_row = [0,eng,meanings]
    worksheet.append_row(new_row,table_range='A1')
    time.sleep(10)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up