More than 1 year has passed since last update.

【スクレイピング】データ抽出→CSVファイル作成（全自動化）

Last updated at 2022-10-16Posted at 2022-10-02

Todo

こちらのサイトから観光地のランキングデータを抽出し、CSVファイルに書き込む

前提

JupyterLabのインストールを行っておく（目安：5分）

実施

JupyterLabを起動
```
$ jupyter lab
```

コード作成

全自動ランキングデータ抽出.ipynb

# import
import requests
from bs4 import BeautifulSoup
import pandas as pd

# ランキングサイト
url = 'https://scraping-for-beginner.herokuapp.com/ranking/'
res = requests.get(url)

# html
soup = BeautifulSoup(res.text, 'html.parser')

#　全ての配列を格納する変数
data = []

# 10箇所分の観光地の情報
spots = soup.find_all('div', attrs={'class': 'u_areaListRankingBox'})

# 一つの観光地情報
# spot = spots[0]

# 観光地情報
for spot in spots:
    # 観光地名
    spot_name = spot.find('div', attrs={'class': 'u_title'})
    # spanタグ削除
    spot_name.find('span', attrs={'class': 'badge'}).extract()
    # 置換で要らない文字列を消す
    spot_name = spot_name.text.replace('\n', '')

    # 評点
    score = spot.find('div', attrs={'class': 'u_rankBox'}).text.replace('\n', '')
    score = float(score)
    score

    # 項目
    categoryItems = spot.find('div', attrs={'class': 'u_categoryTipsItem'})
    # dlのみを取得
    categoryItems = categoryItems.find_all('dl')

    # for文
    datum = {}
    for categoryItem in categoryItems:
        category = categoryItem.dt.text
        rank = float(categoryItem.span.text) 
        datum[category] = rank

    # 観光地名と評点も加える
    datum['観光地名'] = spot_name
    datum['評点'] = score

    # dataに格納
    data.append(datum)

df = pd.DataFrame(data)
# 修正前カラム出力（順番修正用）
# df.columns

# 順番修正
df = df[['観光地名', '評点', '楽しさ', '人混みの多さ', '景色', 'アクセス']]

# csvファイルとして出力
df.to_csv('観光地情報.csv', index=False)

実行
command + Enter
CSVファイルが作成されていることを確認

関連：【スクレイピング】Webサイト上の画像取得&保存（全自動化）

参考：https://www.youtube.com/watch?v=VRFfAeW30qE

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up