Oracle Cloud InfrastructureAdvent Calendar 2024

[OCI]Compute上のjupyter notebookからWebスクレイピング(OL8)

Last updated at 2024-12-11Posted at 2024-12-11

1. はじめに

OL8上にインストールした jupyter notebook を使い、Webサイトからデータを取得します。

前提条件

https://qiita.com/twakimura/items/e2b145e2460c388ba0cc
この記事を参考に Compute インスタンス上に jupyter notebook を導入しています。

2. 必要なライブラリのインストール

まずhtmlを解析するライブラリをインストールします。

$ pip3 install requests
$ pip3 install bs4
$ pip3 install selenium

また、リッチな解析のために selenium を使うため、Google Chrome の Web Driver もインストールします。

$ sudo dnf install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
$ sudo dnf info google-chrome-stable
google-chrome                                                          14 kB/s | 4.4 kB     00:00
Installed Packages
Name         : google-chrome-stable
Version      : 131.0.6778.108
Release      : 1
Architecture : x86_64
Size         : 348 M
Source       : google-chrome-stable-131.0.6778.108-1.src.rpm
Repository   : @System
From repo    : @commandline
Summary      : Google Chrome
URL          : https://chrome.google.com/
License      : Multiple, see https://chrome.google.com/
Description  : The web browser from Google
             :
             : Google Chrome is a browser that combines a minimal design with sophisticated technology
             : to make the web faster, safer, and easier.

インストールした chrome のバージョンを確認し、同じバージョンの chromedriver をインストールします。

$ google-chrome --version
Google Chrome 131.0.6778.108
$ pip3 install --user chromedriver-binary==131.0.6778.108
ERROR: Could not find a version that satisfies the requirement chromedriver-binary==131.0.6778.108

chrome のバージョンに該当する chromedriver が見つからないというエラーが出ました。
通常、バージョンを合わせるのが正しいと思いますが、今回はエラーメッセージに出力されていた存在するドライババージョンの中で近しいバージョンのものをインストールしました。

$ pip3 install --user chromedriver-binary==131.0.6778.87.0

3. Webスクレイピング

今回は netkeiba さんの開催日情報を取得します。
https://race.netkeiba.com/top/calendar.html?year=2024&month=11

URLは上記のような形式になっており、年と月の情報をURLパラメータとして渡すとその月のカレンダーが取得できます。

getCalendar.ipynb

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import chromedriver_binary
import pandas as pd

# CUIで扱うためのオプション設定
options = Options()
options.add_argument('--headless')  # ヘッドレスモード
options.add_argument('--disable-gpu')  # GPUの無効化
options.add_argument('--no-sandbox')  # サンドボックスを無効化（必要に応じて）
options.add_argument('--disable-dev-shm-usage')  # /dev/shm の容量制限を回避

# ドライバ設定とURL取得
driver = webdriver.Chrome(options=options)
page = 'https://race.netkeiba.com/top/calendar.html?year=2024&month=11'
driver.get(page)
source = driver.page_source
driver.quit()

# htmlソースからCalendar_Tableを取得
soup = BeautifulSoup(source, 'html.parser')
table = soup.find(class_="Calendar_Table")

# ヘッダー行を取得
headers = [th.text.strip() for th in table.find_all("th")]

# テーブルのデータ行を取得
rows = []
for tr in table.find_all("tr")[1:]:  # データ行を取得（ヘッダーをスキップ）
    cells = [td.text.strip() for td in tr.find_all("td")]
    rows.append(cells)

# DataFrameに変換
df = pd.DataFrame(rows, columns=headers)

# DataFrame表示
df

カレンダー情報を取得することができました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up