南関東競馬のデータスクレイピング

Last updated at 2024-10-26Posted at 2024-10-26

1.概要

地方競馬(南関東競馬)の予想を機械学習でやってみよう!
と思ったのが始まりです。

機械学習に必要なデータはJRAだと有料で入手できるが、地方競馬だとなかなか見つからないので、Pythonでスクレイピングをする事にしました。

そういう訳で競馬情報サイトからデータ収集します。

2.データ取得用URLについて

競馬情報サイトnetkeibaからスクレイピングします。
スクレイピングはwebページに大量のアクセスをかけるものです。
サーバーに負荷をかけすぎないよう気を付けてください。

https://nar.netkeiba.com/race/result.html?race_id=202444081211&rf=race_list
※これは2024年8月12日第11レースの出馬表URL
https://nar.netkeiba.com/race/shutuba_past.html?race_id=202444081211&rf=shutuba_submenu
※こちらが同レースの馬柱URL(出走各馬の過去5走分のデータを表形式で集めたもの)

この二つの情報を一定年数分抽出します。

3.スクレイピングの手順

1)上記URLのrace_id=以降の数字部分が年、会場、月、日、レース番号を示しているため、それを5重ループで定義する

※会場は以下のコードになっているため、配列としてループを回す
門別:30、盛岡:35、水沢:36、浦和:42、船橋:43、大井:44、川崎:45、金沢:46、笠松:47、名古屋:48、園田:50、姫路:51、高知:54、佐賀:55

2)まず出馬表URLにBeautifulSoupを使ってアクセス。
もし出馬表URLが空なら馬柱URLも空なので、その時点でループを抜ける

3)まずそのレースの施行条件を抽出
(レース名、発走時刻、芝orダ、距離、回数、会場、日数、賞金、レースの格)

4)次にレースの結果をデータフレームで抽出
(着順、枠、馬番、馬名、年齢、性別、斤量、騎手、タイム、着差、人気、単勝オッズ、ラスト3ハロン(600m)タイム、厩舎、場体重、場体重増減)

5)最後に過去5走分のデータをデータフレームで抽出(馬番、印、馬名オッズ、騎手斤量、前走、2走前、3走前、4走前、5走前)
※馬名オッズには4)で抽出した情報と同じ情報が入っているが、レース間隔や父母名も入っているため、取得する
※n走前には一つのセルに複数の情報が入る形になるが、学習時に分割するため、問題なし

6)レース結果と過去5走分のデータを馬番で結合

7)レースの施行条件を出走頭数分二次元のデータフレーム化

8)結合したものと施行条件を結合

9)次のループ(レース)へ

4.コード

注意点
※可読性は悪いです
※2024年8月13日時点では使えていますが、今後使えなくなるかも

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import re
import csv
import sys

#URLの共通部分
url_header='https://nar.netkeiba.com/race/result.html?race_id='
url_shutuba_header='https://nar.netkeiba.com/race/shutuba_past.html?race_id='
url_shutuba_fooder='&rf=shutuba_submenu'

placenumber=[30,35,36,42,43,44,45,46,47,48,50,51,54,55]

#ヘッダー
header_list=['this_date','this_place','this_weather','this_race_title','this_race_sum','this_race_shiba','this_dart_dart','this_race_distance','this_race_ryou_truck','this_race_yayaomo_truck','this_race_omo_truck','this_race_huryou_truck',
             'this_result','this_waku','this_ban','this_male','this_female','this_sen','this_kinryo','this_jockey','this_time','this_ninki','this_odds','this_last_three_halons_time','this_stable','this_weight','this_weight_change',
             'once_date','once_place','once_weather','once_race_title','once_race_sum','once_race_shiba','once_dart_dart','once_race_distance','once_race_ryou_truck','once_race_yayaomo_truck','once_race_omo_truck','once_race_huryou_truck',
             'once_result','once_waku','once_ban','once_kinryo','once_jockey','once_time','once_ninki','once_odds','once_last_three_halons_time','once_weight','once_weight_change',
             'twice_date','twice_place','twice_weather','twice_race_title','twice_race_sum','twice_race_shiba','twice_dart_dart','twice_race_distance','twice_race_ryou_truck','twice_race_yayaomo_truck','twice_race_omo_truck','twice_race_huryou_truck',
             'twice_result','twice_waku','twice_ban','twice_kinryo','twice_jockey','twice_time','twice_ninki','twice_odds','twice_last_three_halons_time','twice_weight','twice_weight_change',
             'third_date','third_place','third_weather','third_race_title','third_race_sum','third_race_shiba','third_dart_dart','third_race_distance','third_race_ryou_truck','third_race_yayaomo_truck','third_race_omo_truck','third_race_huryou_truck',
             'third_result','third_waku','third_ban','third_kinryo','third_jockey','third_time','third_ninki','third_odds','third_last_three_halons_time','third_weight','third_weight_change',
             'fourth_date','fourth_place','fourth_weather','fourth_race_title','fourth_race_sum','fourth_race_shiba','fourth_dart_dart','fourth_race_distance','fourth_race_ryou_truck','fourth_race_yayaomo_truck','fourth_race_omo_truck','fourth_race_huryou_truck',
             'fourth_result','fourth_waku','fourth_ban','fourth_kinryo','fourth_jockey','fourth_time','fourth_ninki','fourth_odds','fourth_last_three_halons_time','fourth_weight','fourth_weight_change',
             'fifth_date','fifth_place','fifth_weather','fifth_race_title','fifth_race_sum','fifth_race_shiba','fifth_dart_dart','fifth_race_distance','fifth_race_ryou_truck','fifth_race_yayaomo_truck','fifth_race_omo_truck','fifth_race_huryou_truck',
             'fifth_result','fifth_waku','fifth_ban','fifth_kinryo','fifth_jockey','fifth_time','fifth_ninki','fifth_odds','fifth_last_three_halons_time','fifth_weight','fifth_weight_change',]

#webアクセスのプログラム
def web_access(url):
      r=requests.get(url)
      time.sleep(1.0)
      soup=BeautifulSoup(r.content,"html.parser")
      return soup

# ... existing code ...

# 指定したURLからテーブルデータを抽出する関数
def extract_race_result_table(url):
    soup = web_access(url)
    table = soup.find('table', {'summary': '全着順', 'class': 'RaceTable01 RaceCommon_Table ResultRefund Table_Show_All ResultMain', 'id': 'All_Result_Table'})
    
    headers = [header.text.replace('\n', '').strip() for header in table.find_all('th')]
    
    rows = []
    for row in table.find_all('tr'):
        cells = row.find_all('td')
        if len(cells) > 0:
            rows.append([cell.get_text(separator=' ').replace('\n',' ').strip() for cell in cells])
    
    df = pd.DataFrame(rows, columns=headers)
    return df

# ... existing code ...
def normalize_spaces(text):
    # すべての種類のスペースを正規表現でキャッチ
    normalized_text = re.sub(r'[ 　]+', ' ', text)
    return normalized_text
# ... existing code ...

# 各馬の過去5走データを取得する関数
def extract_past5_data(soup):
    past5_table = soup.find('table', class_='Shutuba_Table Shutuba_Past5_Table')

    if past5_table is None:
        print('過去5走データがありません')
        return None
    
    text_line=[]

    data02_elements=past5_table.find_all('div',class_='Data02')
    for element in data02_elements:
        text = element.get_text().replace('　','').replace(' ','').replace('\n','')
        text_line.append(text)
    
    for i,element in enumerate(data02_elements):
        if i<len(text_line):
            element.string = text_line[i]


    headers = [header.text.replace('\n','') for header in past5_table.find_all('th')]
    
    rows = []
    for row in past5_table.find_all('tr'):
        cells = row.find_all('td')
        if len(cells) > 0:
            
            row_data = [re.sub(r'[ 　]+', ' ', cell.get_text(separator=' ').replace('\n',' ').strip()) for cell in cells]
            rows.append(row_data)
    
    if not rows:
        print('No rows found in past_5_table')
        return None
    
    df = pd.DataFrame(rows, columns=headers)
    return df

def extract_race_class(race_title):
    race_class = '0'
    if 'OP' in race_title:
        race_class = 'OP'
    elif '重賞' in race_title:
        race_class = '重賞'
    elif 'Jpn1' in race_title:
        race_class = 'JPN1'
    elif 'Jpn2' in race_title:
        race_class = 'JPN2'
    elif 'Jpn3' in race_title:
        race_class = 'JPN3'
    elif 'G1' in race_title:
        race_class = 'G1'
    elif 'G2' in race_title:
        race_class = 'G2'
    elif 'G3' in race_title:
        race_class = 'G3'
    elif ('デビュー' in race_title) or ('新馬' in race_title):
        race_class = '新馬'
    elif '未勝利' in race_title:
        race_class = '未勝利'
    else:
        match = re.search(r'(C|B|A)(\d+)', race_title)
        if match:
            race_class = match.group(1) + convert_number_to_kanji(match.group(2))
        elif '1勝クラス' in race_title:
            race_class = '一勝クラス'
        elif '二勝クラス' in race_title:
            race_class = '二勝クラス'
        elif '三勝クラス' in race_title:
            race_class = '三勝クラス'
    return race_class

def race_class_plus(past5_list, race_classes):
    for race_class in race_classes:
        if re.search(race_class, past5_list[3]):
            # C~A+数字+漢数字の場合、漢数字のみを削除
            match = re.search(r'(C|B|A)(\d+)([一二三四五六七八九十]*)', past5_list[3])
            if match:
                past5_list[3] = match.group(1) + match.group(2)  # 漢数字を削除
                past5_list.insert(4, match.group(1) + match.group(2))
            else:
                # 例: まるまる賞(A1) の場合、A1を挿入
                match = re.search(r'\((C|B|A\d+)\)', past5_list[3])
                if match:
                    past5_list.insert(4, match.group(1))
                else:
                    past5_list.insert(4, race_class)
            break
    return past5_list

# ... existing code ...

# レース情報を抽出する関数
def extract_race_info(soup):
    race_info = {}
    
    # レース名
    race_name = soup.find('div', class_='RaceName').get_text(strip=True)
    race_info['レース名'] = race_name
    
    # レースの格、芝ダートと距離、天気、馬場
    race_data_1 = soup.find('div', class_='RaceData01').get_text().replace('/', ' ').split()
    race_info['発走時刻'] = race_data_1[0] if len(race_data_1) > 0 else '0'
    race_info['芝orダと距離']=race_data_1[1] if len(race_data_1)>1 else '0'
    race_data_2 = soup.find('div', class_='RaceData02').get_text().replace('/', ' ').split()
    race_info['回数'] = race_data_2[0] if len(race_data_2) > 0 else '0'
    race_info['場所'] = race_data_2[1] if len(race_data_2) > 1 else '0'
    race_info['日目'] = race_data_2[2] if len(race_data_2) > 2 else '0'
    race_info['頭数'] = race_data_2[6] if len(race_data_2) > 6 else '0'

    # レースの格を抽出
    race_class = '未格付'
    if 'OP' in race_name:
        race_class = 'OP'
    elif '重賞' in race_name:
        race_class = '重賞'
    elif 'Jpn1' in race_name:
        race_class = 'JPN1'
    elif 'Jpn2' in race_name:
        race_class = 'JPN2'
    elif 'Jpn3' in race_name:
        race_class = 'JPN3'
    elif 'G1' in race_name:
        race_class = 'G1'
    elif 'G2' in race_name:
        race_class = 'G2'
    elif 'G3' in race_name:
        race_class = 'G3'
    elif 'デビュー' in race_name or '新馬' in race_name:
        race_class = '新馬'
    elif '未勝利' in race_name:
        race_class = '未勝利'
    else:
        match = re.search(r'(C|B|A)(\d+)', race_name)
        if match:
            race_class = match.group(1) + convert_number_to_kanji(match.group(2))
        elif '1勝クラス' in race_name:
            race_class = '一勝クラス'
        elif '2勝クラス' in race_name:
            race_class = '二勝クラス'
        elif '3勝クラス' in race_name:
            race_class = '三勝クラス'
    
    race_info['レースの格'] = race_class
    

    
    return race_info


# 数字を漢数字に変換する関数
def convert_number_to_kanji(number):
    kanji_numbers = {
        '0':'零', '1': '一', '2': '二', '3': '三', '4': '四',
        '5': '五', '6': '六', '7': '七', '8': '八', '9': '九'
    }
    return ''.join(kanji_numbers[digit] for digit in number)


# データ抽出関数
def extract_data_from_url(url, table_class):
    soup = web_access(url)
    table = soup.find('table', {'class': table_class})
    
    if table:
        print(f"Table with class '{table_class}' found")
        df = extract_race_result_table(url)
        race_info=extract_race_info(url)
        return df, race_info
    else:
        print(f"Table with class '{table_class}' not found")
        return pd.DataFrame(), None


all_maerged_dfs=[]

for year in range(2018,2019):#2019~2023年中のデータを収集
    #競馬場の番号
    #門別:30、盛岡:35、水沢:36、浦和:42、船橋:43、大井:44、川崎:45、金沢:46、笠松:47、名古屋:48、園田:50、姫路:51、高知:54、佐賀:55
    for place in range(42,46):
     for month in range(1,13):
         for day in range(1,32):
             for race in range(1,13):
                try:
                    url=url_header+str(year)+str(place)+str(month).zfill(2)+str(day).zfill(2)+str(race).zfill(2)
                    print(url)

                    url_shutuba=url_shutuba_header+str(year)+str(place)+str(month).zfill(2)+str(day).zfill(2)+str(race).zfill(2)+url_shutuba_fooder
                    print(url_shutuba)

                    soup=web_access(url)

                    #正解データを含むレースのレース情報
                    #出走頭数、距離、天気、馬場、レース名、日にちを取得
                    race_data_1=soup.find('div',class_='RaceData01').get_text().replace('/',' ').splitlines()
                    race_data_1 = [x for x in race_data_1 if x!='']
                    race_data_2=soup.find('div',class_='RaceData02').get_text().replace('/',' ').splitlines()
                    race_data_2 = [x for x in race_data_2 if x!='']
                    print(race_data_1)
                    print(race_data_2)
                    #修正用
                    if '\xa0' in race_data_1:
                         exclude_words = ['\xa0']
                         for word in race_data_1:
                              if word in exclude_words:
                                   race_data_1.remove(word)
                    
                    if len(race_data_2) >= 3 and race_data_2[2].strip() == '':
                         print("1レース目の情報がないため、翌日のレースへ")
                         break
                    if '-' in race_data_1:
                         print('レースが行えなかったため、次のレースへ')
                         continue

                    #確認用
                    #print(year_month_day,this_place,weather,race_title,race_sum,shiba,dart,distance,ryou,yayaomo,omo,huryou)

                    #最新レースのレース条件は完全に完了

                    #各馬のurlを取得
                    horse_info_cells = soup.find_all('td', class_='Horse_Info')
                    haraimodosi_info=soup.find('tr', class_='Tansho').get_text()
                    if '返還' in haraimodosi_info:
                        print('レース取りやめ、翌日へ')
                        break
                    horse_url = [a['href'] for cell in horse_info_cells for a in cell.find_all('a', href=True)]
                    print(horse_url)
                    race_sum=len(horse_url)

                    if not (isinstance(race_sum, (int, float)) or (isinstance(race_sum, str) and race_sum.isdigit())):
                         continue

                    race_result_df=extract_race_result_table(url)
                    if race_result_df is None:
                        print('No race result data, skipping this race')
                        continue
                    race_info=extract_race_info(soup)
                    print(race_result_df)
                    print(f"Race Info: {race_info}")
                    
                    # ... existing code ...
                    
                    # 各馬の過去5走データを取得
                    soup_2=web_access(url_shutuba)
                    past5_df = extract_past5_data(soup_2)
                    if past5_df is None:
                        print('No past 5 data found,skipping this race')
                        continue
                    past5_df.replace('',0,inplace=True)
                    past5_df.replace(0,inplace=True)
                    print(past5_df.to_string(index=False))

                    print(past5_df)

                    #カラム名「前走」から「5走」までの要素の処理
                    race_classes = ['OP','重賞','Jpn1','Jpn2','Jpn3','G1','G2','G3','新馬','デビュー','未勝利','1勝クラス','2勝クラス','3勝クラス',r'A\d+',r'B\d+',r'C\d+',r'[ABC]\d+[一二三四五六七八九十]']
                    for col in ['前走','2走','3走','4走','5走']:
                        for line in range(past5_df.shape[0]):
                            past5_str = str(past5_df.loc[past5_df.index[line],col])
                            #除外系を0埋め
                            past5_str=normalize_spaces(past5_str)
                            past5_line = past5_str.split(' ')
                            if not len(past5_line)==1:
                                if (past5_line[2]=='除') or (past5_line[2]=='中'):
                                    if len(past5_line)==14:
                                        past5_line.insert(-1,'0')
                                        past5_line.insert(-4,'0')
                                        past5_line.insert(6,'0')
                                    elif len(past5_line)==13:
                                        past5_line.insert(-1,'0')
                                        past5_line.insert(-1,'0')
                                        past5_line.insert(-4,'0')
                                        past5_line.insert(6,'0')
                            #条件分岐:
                            #1:レースの格情報が入っていない…この時はタイトルにも入っていないし、スクレイピングで抽出できてもいない
                            if not any(re.search(race_class, past5_str) for race_class in race_classes):
                                past5_str=normalize_spaces(past5_str)
                                past5_line = past5_str.split(' ')
                                past5_line.insert(4,'未格付')
                                past5_new_str=' '.join(past5_line)
                                past5_df.loc[past5_df.index[line],col] = past5_new_str
                            #2:レースの格情報がレース名にのみ乗っている
                            #3:既にレース情報に格が入っている
                            else:
                                if not (past5_str==0 or past5_str=='0'):
                                    past5_str=normalize_spaces(past5_str)
                                    past5_line = past5_str.split(' ')
                                    if not any(re.search(race_class, past5_line[4]) for race_class in race_classes):
                                        #真ならばレース名にのみ格情報が入っている
                                        # #偽ならレースの格情報が入っている
                                        past5_line=race_class_plus(past5_line,race_classes)
                                        past5_new_str = ' '.join(past5_line)
                                        past5_df.loc[past5_df.index[line],col] = past5_new_str
                    # レース情報をデータフレーム化
                    race_info_df=pd.DataFrame([race_info]*len(race_result_df),columns=race_info.keys())

                    #レース情報とレース結果を結合
                    race_result_df=pd.concat([race_info_df,race_result_df],axis=1)

                    maerged_df=pd.merge(race_result_df,past5_df,left_on='馬番',right_on='馬番',how='inner')
                    print(maerged_df)

                    all_maerged_dfs.append(maerged_df)
                except:
                    continue    #print(all_maerged_dfs)
final_df=pd.concat(all_maerged_dfs,ignore_index=True)

print(final_df)



final_df.to_csv('2018.csv',index=False,encoding='utf-8-sig')

print('CSVファイルに出力しました')

5.機械学習に関して

別記事で投稿します

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up