More than 5 years have passed since last update.

【練習】pythonでbaseball-labをスクレイピングする

Last updated at 2019-07-10Posted at 2019-07-10

はじめに

【世界で5万人が受講】実践 Python データサイエンス
を受講して野球のデータを触ってみたくなったので
スクレイピングしてみることにした。
著者のプログラミング歴は2週間ほど

やってみたこと

ベースボールラボから2018年のベイスターズの野手データを抜き出す
抜き出したデータをdataframe化

参考

ベースボールラボ
 【世界で5万人が受講】実践 Python データサイエンス

コード

from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame

# 2018年のベイスターズの野手データ
url = 'http://www.baseball-lab.jp/player/batter/3/2018/'

# この辺は講座の情報通りおこなった
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)
summary = soup.find('div', {'class': 'content-holder'})
tables = summary.find_all('table')
data = []
rows = tables[0].find_all('tr')
for tr in rows:
    cols = tr.find_all('td')
    th_sort = tr.find_all('th')
    for td in cols:
        players = td.find(text=True)
        data.append(players)


# numpyをimportして、リストをアレイ化して26個ずつにreshapeする
import numpy as np
arr1 = np.array(data).reshape(-1,26)

ここまででdataframe化するアレイは用意できたのでcolumnsに設定する項目を作っておく（thタグから取り出そうとしたが、改行のせいか中身がないデータが返ってきたため）

:index_batter.txt
背番号
選手名
試合
打席
打数
得点
安打
二塁打
三塁打
本塁打
塁打
打点
三振
四球
敬遠
死球
犠打
犠飛
盗塁
盗塁刺
併殺打
失策
打率
長打率
出塁率
OPS

# columnsに名前を付ける
f = open('index_batter.txt')
index_batter = f.read().split()
print(index_batter)
f.close()

df = DataFrame(arr1)
df.columns = index_batter

# 実行
df

今後の課題

選手名に改行やスペースがあり、扱いづらいので修正したい
columnsの名前をhtmlから直接引っ張ってきたい
年度別や他チームとの比較したい
データの可視化

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up