More than 3 years have passed since last update.

Python でのスクレイピング入門

Last updated at 2021-02-22Posted at 2021-01-22

はじめに

今(2021年1月)PythonでWebページのスクレイピングの勉強をしているので、スクレイピングをする際の必要不可欠な部分をまとめました。自分と同じように皆さんがPythonでスクレイピングをする時の参考になれば幸いです。

対象者

Python初心者でPythonでスクレイピングをしたい人
Pythonのスクレイピングの書き方を忘れた人

インストールするモジュール

bash

# chromedriverをインストール
$ brew install --cask chromedriver

# seleniumをインストール
$ pip install selenium

# requestsをインストール
$ pip install requests

# urlibをインストール
$ pip install urllib

# beautifulsoupをインストール
$ pip install beautifulsoup4

自動で Web ブラウザ（GoogleChrome)を起動

webdriver.Chrome()

python

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("指定のurl")

自動でテキスト入力

python

# HTMLのid属性
ele = driver.find_element_by_id("id名")
# またはname属性
ele = driver.find_element_by_name("name名")

# テキスト入力
ele.send_keys(入力したい値)

# ボタンクリック
ele.click()

現在表示しているページのソースコードを取得

page_source

python

html = driver.page_source.encode('utf-8')

url 取得

urllib.request.urlopen

python

import urllib.request

url = urllib.request.urlopen("指定のurl")

または

requests.get

python

import requests

url = requests.get("指定のurl")

※ python は後者を推奨

HTML および XML ファイルの解析

BeautifulSoup

python

# pythonの標準で"html.parser"（"lxml"や"html5lib"などもある)
from bs4 import BeautifulSoup

soup = BeautifulSoup(url.text, "html.parser")

テキスト取得

find, find_all

python

# findは''に囲まれた要素を返す
elem = soup.find('a').text
# find_allはforを使い全ての''に囲まれた要素を返す
elems = soup.find_all('a')
for elem in elems:
    print(elem.text)

他にもselect_one(), select()を使うやり方もある
＊ find メソッドのより詳細な書き方

最後に

今回はPythonでWebページをスクレイピングする際に必要となるモジュールやメソッドを紹介しました。次回からは実際に何かしらのWebページをスクレイピングしていこうと思います。

続き：Qiita 1日トレンドスクレイピング（Python）

参考サイト

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Python でのスクレイピング 入門

はじめに

対象者

インストールするモジュール

自動で Web ブラウザ（GoogleChrome)を起動

webdriver.Chrome()

自動でテキスト入力

現在表示しているページのソースコードを取得

page_source

url 取得

urllib.request.urlopen

requests.get

HTML および XML ファイルの解析

BeautifulSoup

テキスト取得

find, find_all

最後に

参考サイト

Python でのスクレイピング入門