More than 3 years have passed since last update.

[Python]スクレイピング初心者がJリーグの順位表をCSVファイルに保存するまで

Posted at 2020-06-21

スクレイピングの復習

スクレイピングが気になってひとまず何かのデータを取りたかったので、以下のサイトを参考にしつつ、スクレイピングをしてみました。
https://www.atmarkit.co.jp/ait/articles/1910/18/news015_2.html
復習がてらに書くので、スクレイピングを初めてする方の参考になればと思います！
Pythonを使ってGoogleColabで書きました。
そのため、ローカルでの記述とは異なるところがあるかもしれません。

スクレイピングの基礎

requestとBeautiful soupでスクレイピングを行いました。
requestでは、指定したウェブｋらファイルを取得して、Beautiful soupで取得したファイルから欲しい情報を抜き出します。
サイトにある通り、Jリーグの順位表を取得するプログラムを書いています。
また、追加でCSVに保存するところまで書いています。
以下に今回使ったコードを記します。

qiita.rb

from bs4 import BeautifulSoup
from urllib import request

url = 'https://www.jleague.jp/standings/j1/'
response = request.urlopen(url)
content = response.read()
response.close()

charset = response.headers.get_content_charset()
html = content.decode(charset, 'ignore')
soup = BeautifulSoup(html)

table = soup.find_all('tr')

standing = []
for row in table:
    tmp = []
    for item in row.find_all('td'):
        if item.a:
            tmp.append(item.text[0:len(item.text) // 2])
        else:
            tmp.append(item.text)
    del tmp[0]
    del tmp[-1]
    standing.append(tmp)

for item in standing:
    print(item)

import pandas as pd
from google.colab import files 
del standing[0]
df = pd.DataFrame(standing,columns = ['順位', 'クラブ名', '勝点', '試合数', '勝', '分', '負', '得点', '失点', '得失点'])

from google.colab import drive

filename = 'j1league.csv'
path = '/content/drive/My Drive/' + filename

with open(path, 'w', encoding = 'utf-8-sig') as f:
  df.to_csv(f,index=False)

途中細かく確認しながら実装したので、間にprint()を挟んでいましたが、ここでは、一気にファイルに保存するところまで実装しています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up