More than 3 years have passed since last update.

BeautifulSoup,Selenium備忘録

Posted at 2021-01-23

はじめに##

スクレイピングの基本操作を備忘録用として投稿します。

対象ページ、クラスのスクレイピング

qiita.rb


import sys
from selenium import webdriver
import os
from bs4 import BeautifulSoup


options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome('chromedriver',options=options) 
# 対象のWebページURLを宣言します 
url = "https://qiita.com/" 
# 対象URLを取得 
browser.get(url)
html = browser.page_source.encode('utf-8')
soup = BeautifulSoup(html,'html.parser')
list = soup.find_all(class_ = "css-1laxd2k")

print(list)

結論こちらに対象URLと取得したいクラスやidをいれたら取得できる。
本来requests.get()で取得した方が楽ですけどね。

BeautifulSoupとは##

requestsなどによってHTMLのデータをスクレイピングした後に
そのHTMLを整形するために使用。
所謂BeautifulSoupだけではスクレイピングはできない。
これ使えばスクレイピングしたデータに色々できる

Seleniumとは##

Webページの自動化を行うためのフレームワーク
今回はoptionに色々やっているのでclickなどによるページ遷移も
すぐできたり

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up