Building a Horse Racing Prediction System from Scratch in Six Months — A Complete Development and Improvement Record

Posted at 2026-04-06

Building a Horse Racing Prediction System from Scratch in Six Months — A Complete Record: A Self-Improving Scoring Engine Built with 17 Python Scripts and an Accumulated Database

Period: September 2025 – March 31, 2026 　　Author: yuji

Introduction

"Could I predict horse racing outcomes using data?"

That single question sparked a project that, over six months, grew into a system of 17 Python scripts, a 4-pipeline parallel architecture, and a place-bet hit rate of 51%.

This article is the complete record of that journey — from the design philosophy of the algorithms, to coefficient tuning based on real data, to the results of verifying all 36 races. Every step of development is documented here.

Note: This article describes the system's design, algorithms, and development methodology.
Always check the terms of service for any external services before accessing or using their data.

Project Summary

Item	Value
Total scripts created	25 (V1: 17 + V2 new/revised: 8)
horse_data revision count	35 times (the origin of "v35")
jockey_data struggle count	32 times (single day: Feb. 16)
Races verified	36+ races (cumulative through Mar. 28–29)
Place-bet hit rate	51% (stable benchmark)
Development environment transitions	Windows → Lubuntu → Ubuntu (i7) → dual-machine (i7 + Celeron)

Development Environment and Background

Environment Transitions

September 2025   Fresh Lubuntu setup → basic web scraping research
January 2026     Clean install on Ubuntu/SSD → full-scale development begins
March 2026       Added Ubuntu to Celeron machine → dual setup: i7=production, Celeron=development

Serious coding began the week of January 4, 2026, right after the New Year holidays, with a clean Ubuntu install. The date recorded in the opening comment of horseID_reisugo_v1.py — [No1] 2026-01-23 — marks the official declaration that this project was here to stay.

Overall System Architecture

4-Pipeline Parallel Architecture

horse_racing_prediction_system/
├── 01_seiki/          # Official entry sheet pipeline (race day; post and horse numbers available)
├── 02_hiseiki/        # Unofficial entry sheet pipeline (pre-race; fundamentally different HTML structure)
├── 03_racekakutei/    # Post-race pipeline (reverse analysis from confirmed results)
├── 04_hikaku/         # 3-pipeline comparison and hit rate verification
├── 05_osushi/         # "Favorite bet" app (fully independent; reads v5 output)
├── 06_batch_yosou/    # Automated batch prediction generation
├── 07_batch_hikaku/   # Automated batch comparison generation
└── result_seiki/      # Auto-save prediction results as xlsx

Data Flow

Entry sheet URL (entered once)
    │
    ▼
STEP 1: Fetch HTML and cache (once only)
    ├─ Extract horse IDs (10-digit) → horseID.xlsx (instruction sheet)
    └─ Extract jockey IDs (5-digit) → jockeyID.xlsx (instruction sheet)
    │
    ▼
STEP 1.5: Fetch last 5 race records → predict final 3F time and horse weight
    │
    ▼
STEP 2: Fetch horse data ─────────────────────────┐
    │         ↓ 3-stage logic                      │
    │   _work.csv (current race data)              │
    │         ↓ merge and deduplicate              │
    └─ _database_accumulated.csv (accumulated DB) ◀─┘
    │
    ▼
STEP 3: Fetch jockey data (same 3-stage logic)
    │
    ▼
STEP 4: Load accumulated DB → calculate scores → output prediction
    │
    ▼
Auto-save xlsx (result_seiki/)
Append to CSV (04_hikaku/yosou_hikaku_data.csv)

Establishing the 3-Stage Logic

The core design philosophy of the system. Established on February 20, 2026, in jockey_data_seiki_v3, and subsequently rolled out across all pipelines.

The Root Problem

The original simple "fetch immediately, then save" approach frequently caused a critical issue: skipped entries were not recorded in the Work file. Horses already in the DB would be skipped, meaning they never entered the Work file, causing missing entries during score calculation.

Solution: 3-Stage Structure

# ① Fetch phase: collect only unregistered entries (skip those already in DB)
newly_fetched = []
for h_id in h_ids:
    if h_id in accum_ids:
        continue  # skip
    # fetch data...
    newly_fetched.append(data)

# ② Merge phase: combine new + existing, deduplicate
df_merged = pd.concat([df_existing, df_new], ignore_index=True)
df_merged = df_merged.drop_duplicates(
    subset=['HorseID', 'Date', 'Venue', 'R'], keep='last'
)
df_merged.to_csv(ACCUM_CSV, ...)  # save to accumulated DB

# ③ Extract phase: extract only today's entries into Work (guarantees full lineup)
df_work = df_merged[df_merged['HorseID'].isin(h_ids)].copy()
df_work.to_csv(WORK_CSV, ...)     # save to Work

This 3-stage structure guarantees that even skipped horses are always pulled from the accumulated DB into the Work file.

Scoring Algorithm (v5)

Score Components

total = (
    base        # finish-order score × 0.6 + upset score × 0.4
  + agari       # final 3F time (with distance-range coefficients)
  + choushi     # condition score (weight change and recent trend)
  + tekisei     # track surface and distance aptitude
  + prize       # prize money (log-transformed)
  + waku_b      # post position bonus (distance-range aware)
  + pop_b       # popularity adjustment
  + compat      # jockey × horse combo affinity (reliability-weighted)
  + chakusa     # margin score
  + j_win * 0.08 + j_fuku * 0.03  # jockey rank bonus
) * boost       # jockey rank multiplier (S:1.35 / A:1.18 / B:1.07 / C:1.00)
  * track_f     # venue coefficient
  * class_f     # race class coefficient
  * baba_f      # track condition coefficient

Key Coefficient Design

Venue Correction Coefficients (TRACK_FACTOR)

TRACK_FACTOR = {
    'Tokyo': 1.00, 'Nakayama': 1.00, 'Hanshin': 1.00, 'Kyoto': 1.00,
    'Chukyo': 0.95,
    'Kokura': 0.90, 'Fukushima': 0.90, 'Niigata': 0.90, 'Sapporo': 0.90, 'Hakodate': 0.90,
}

Design rationale: Real performance data changed the design — going 0 for 8 across two days at Kokura. Coefficients determined by actual data, not theory.

Race Class Reliability Coefficients (CLASS_FACTOR)

CLASS_FACTOR = {
    'G1': 0.98, 'G2': 0.99, 'G3': 1.00,   # Grade races are dampened
    'OP': 1.02, '3-win': 1.00,
    '2-win': 0.97, '1-win': 0.94,
    'Maiden': 0.92, 'Newcomer': 0.92,
}

Why Grade races are down-adjusted: The conclusion from 36-race verification was that "Grade races are volatile and prone to score inflation." Counterintuitive, but data-driven.

Distance-Range Weight Coefficients for Final 3F (DIST_FACTOR)

DIST_FACTOR = {
    1200: 0.50,  # In sprints, final 3F has low differentiating power
    1600: 0.80,
    2000: 1.00,  # Baseline
    2400: 1.10,
    3000: 0.90,  # Long distance is highly pace-dependent
}

The Double-Counting Bug in Base Score (and Its Fix)

The biggest bug fix in v5 was eliminating the "double-counting of the base score."

# ❌ Old design (v5 before No.10): finish order used twice → score inflates to ~40
base = (avg_ninki - avg_chakujun) + (6 - avg_chakujun)
#                                      ↑ finish order counted twice

# ✅ New design (No.10 onward): two-axis separation with upper bound → converges to ~9 max
chakujun_score = max(0.0, 6.0 - avg_chakujun)       # how good the finish was
upset_score    = max(0.0, avg_ninki - avg_chakujun)  # outperforming popularity
base = chakujun_score * 0.6 + upset_score * 0.4

Impact of this fix:

Top-pick win rate: 13.9% → 19.4% (+2 races)
Complete misses: 11 → 6 (−5 races)

Virtual 9-Race History: Self-Improving Design

A design that eliminates the need for paid race history services.

VIRTUAL_PAST_MAX = 9

def get_virtual_past9(horse_id: str, df_accum: pd.DataFrame) -> pd.DataFrame:
    """
    Retrieves up to 9 past records for the same horse ID from the accumulated DB,
    sorted by date descending.
    With each race run, more data accumulates in the DB — the system
    automatically improves over time: a "self-improving" design.
    """
    df_horse = df_accum[df_accum['HorseID'] == horse_id].copy()
    return df_horse.sort_values('Date', ascending=False).head(VIRTUAL_PAST_MAX)

Even with no data on the first run, the system operates — and with each subsequent run, a virtual 9-race history is automatically built up.

The Wall Hit in the Unofficial Pipeline: Cell-by-Cell Rolling Parse

The HTML structure of the official and unofficial entry sheets is fundamentally different. The biggest obstacle was the discovery that jockey IDs do not exist in the HTML source of the unofficial version.

The Trial Log (February 16 — 32 Attempts in One Day)

Attempts 1–5:   Analyzed mobile version → completely different structure, failed
Attempts 6–12:  Identified class names in desktop version → frequent class name mismatches
Attempts 13–20: Used Selenium for dynamic rendering → data attributes absent
Attempts 21–28: Checkmark encoding issues, stable name contamination, wall after wall
Attempts 29–32: "Just scan every cell" → breakthrough via rolling cell parse

The breakthrough logic:

# BAD: Look for cells by specific class name (breaks when structure changes)
jockey_cell = row.find('td', class_='JockeyName')

# GOOD: Scan all cells exhaustively and identify by pattern matching (rolling method)
for td in row.find_all('td'):
    cells = [c.get_text(strip=True) for c in td.find_all(['td', 'th'])]
    if re.search(r'\d+\.\d+', cells[some_index]):  # pattern unique to jockeys
        # found

Version numbers v35 (horse_data) and v3 (jockey_data) are not mere identifiers — they literally represent the number of actual revisions made in the terminal.

File-Header Comment History as a Development Method

Every script in this project carries a modification history in its opening comment block.

# ==============================================================================
# [FILE]        seiki_keibayosou_app_v5.py
# [CREATED]     2026-02-20
# [VERSION]     v5.3
# [CHANGE LOG]
#   - [No.01] 2026-02-20: Initial creation. Based on the unofficial version's
#                         jockey rolling-parse logic; fully adapted to the
#                         seiki v35 accumulated DB structure.
#   - [No.05] 2026-03-02: Refinements based on full-race result verification
#                         from Feb. 28 and Mar. 1.
#             [Fix ①] Popularity adjustment coefficient: 0.4 → 0.12
#                Reason: Popularity adjustment backfired repeatedly in
#                        Tulip Stakes and Jinnawa S.
#             [Fix ②] Final 3F baseline: 36.0 → 38.0
#                Reason: Turf standard is 34–35s. At 36.0, all horses scored
#                        too low to differentiate.
#             [Fix ③] Added venue correction coefficient (TRACK_FACTOR)
#                Reason: Went 0 for 8 at Kokura.
# ==============================================================================

Why Comments Instead of Git?

In a workflow of rapid terminal-based iteration, the log needs to be right where you can see it when you open the file. The design makes every file its own development journal.

[NG:] comments are intentionally kept — to prevent hitting the same wall twice.

The Verify-Tune Cycle

The philosophy established in V1 (through Mar. 9) — "real data changes the design" — deepened in V2 (Mar. 10 onward) into "reject theory with data."

Using the Post-Race Pipeline as a Testing Machine

Without touching the production environment (seiki) at all, the technique involves swapping only the correction logic into reisugo_keibayosou_app (post-race pipeline) and running parallel comparisons on the same races.

Verification cycle:
①  Predict with seiki_v5 (production)
②  Implement new logic in reisugo (test machine)
③  Run 5+ parallel comparisons on the same races
④  Judge by data → adopt or reject

Rejected examples (March 29, 2026):

"Venue experience correction (compat → per-venue experience weighting)" → inferior to v5 → rejected
"Jockey boost S: 1.35 → 1.20 downgrade" → 3-place hits dropped from 7 to 5 → rejected; v5 maintained

The cycle of deciding design changes by data, not intuition, became embedded practice.

Full 36-Race Verification Results (Mar. 28–29)

All 36 races across Mar. 28 (Nakayama, Hanshin, Chukyo) and Mar. 29 were verified using v5 and yosou_hikaku_v3 in mode [3].

Results by Venue

Venue	Races	Win Hit	Place Hit	Show Hit	Notes
Nakayama	2 days × 12R	2/12 (17%)	9/12 (75%)	8/12 (67%)	Most stable over 2 days. 75% place hit is remarkable.
Hanshin	2 days × 12R	2/12 (17%)	7/12 (58%)	7/12 (58%)	Won the Maika Cup and Hanshin 12R outright.
Chukyo	2 days × 12R	4/12 (33%)	6/12 (50%)	6/12 (50%)	Chukyo 12R: exacta paid 3000%.
Total	36R	8/36 (22%)	22/36 (61%)	21/36 (58%)	61% place hit is the system's greatest strength.

Identified Challenges

Bet Type	Hit Rate	Assessment
Place (Fukusho)	58%	◎ System's core strength
Win (Tansho)	22%	△ Close to random
Exacta (Umaren)	8%	✕ Combination explosion
Trifecta (3-Rentan)	<5%	✕ Not recommended at this stage

Conclusion: "Focus on place bets, and filter which races to play." Selective filtering emerged as the clear next step.

The "Favorite Bet" App (osushi_meiken) Philosophy

A fully independent design that never touches the v5 core.

v5 prediction result CSV (yosou_hikaku_data.csv)
    ↓ loaded by
osushi_meiken_v2.py
    ↓ narrows down "races worth betting" via 4-axis scoring
Outputs ranked "favorite bets"

4-Axis Score Design

Index gap score — large gaps between top and second pick signal higher reliability
Jockey score — favor races where S/A jockeys dominate the top positions
Field size adjustment — larger fields lower hit probability
Venue correction — based on empirical results: Nakayama +0.03, Chukyo −0.05, etc.

Verification result: MIN_TEKICHU_SCORE ≥ 0.50 → 3/3 = 100% hit rate across 42 cumulative races (small sample, but directionally valid)

Designs Born from Failure

DB Corruption Incident (March 8, 2026)

Intentionally emptying a DB file triggered an EmptyDataError crash. A combination that never occurs in normal usage — yet it surfaced a latent bug.

# After fix: added empty file check
def load_horse_db() -> pd.DataFrame:
    if not os.path.exists(HORSE_ACCUM_CSV):
        return pd.DataFrame()
    if os.path.getsize(HORSE_ACCUM_CSV) == 0:  # ← added
        return pd.DataFrame()
    try:
        df = pd.read_csv(HORSE_ACCUM_CSV, dtype=str)
        ...
    except Exception:
        return pd.DataFrame()

Guarding Against DB Corruption from Wrong URL Input

Running the official-entry script with an unofficial URL causes all horse and post numbers to be recorded as 0, corrupting the DB. A pre-check was added in STEP 1.

# Horse/post number check (DB corruption guard)
entries_with_num = [e for e in entries if e['num'] > 0 and e['waku'] > 0]
valid_ratio = len(entries_with_num) / len(entries) if entries else 0
if valid_ratio < 0.8:
    print("❌ This does not appear to be an official entry sheet.")
    sys.exit(1)

Full Development Timeline

V1 Phase (Sep. 2025 – Mar. 9, 2026)

Date	Milestone	Details
Sep. 2025	Lubuntu setup	Basic scraping research; trial and error begins
Jan. 4, 2026~	Migrated to Ubuntu/SSD	Full-scale development phase begins
Jan. 23, 2026	Project continuation declaration	Official start point of serious development
Feb. 14, 2026	Official horseID v1 → v2	Conceived the night before; promoted to v2 next morning
Feb. 16, 2026	4 unofficial pipeline scripts created	32 attempts on jockey_data in a single day
Feb. 20, 2026	3-stage logic established	Birth of v35 and v3; integration of all 4 scripts begins
Feb. 21, 2026	Unofficial pipeline: unified one-shot mode	Enter URL once; fully automated
Feb. 22–23, 2026	Verification app + post-race pipeline	All 4 pipelines running in parallel
Mar. 2–3, 2026	Verification → coefficient tuning	Kokura shutout → TRACK_FACTOR introduced
Mar. 8–9, 2026	Final polish across all pipelines	Production-ready quality achieved

V2 Phase (Mar. 10 – Mar. 31, 2026)

Date	Milestone	Details
Mar. 10~	seiki_v5 hit rate improvement	Base score double-counting fixed → 19.4% achieved
Mar. 17	Virtual 9-race history implemented	Self-improving design; paid history services no longer needed
Mar. 17	osushi_meiken_v2 created	"Favorite bet" app; fully independent
Mar. 18	Batch generation tools	make_yosou_batch and make_batch
Mar. 28–29	Full 36-race verification	51% place hit, 75% Nakayama place hit confirmed
Mar. 29	Two hypotheses rejected	Venue experience correction and boost reduction → rejected by data
Mar. 30–31	Dual-machine setup established	i7 for production, Celeron for development

Summary: Evolution of Design Philosophy

Phase	Design Philosophy
V1 Early	Getting it to run is the top priority. Record failures as failures and move on.
V1 Late	Real data, not theory, changes the design.
V2 Phase	Reject theory with data. Numbers over intuition.

Over six months of development, the most important thing I held onto was never deleting failures.

Version number v35 stands for 35 actual revisions. The jockey_data comment log carries the marks of 32 attempts. TRACK_FACTOR was born from the painful lesson of going 0 for 8 at Kokura. All of these are assets that protect my future self from hitting the same walls again.

A 51% place-bet hit rate is not the goal — it's where I am now. Refining the filtering logic to select only the right races to play is the next step, already underway.

Appendix: Full Script List (25 Scripts)

V1 (through Mar. 9, 2026) — 17 Scripts

Filename	Pipeline	Final Ver	Description
horseID_seiki_v2.py	Official	v2	Horse ID extraction; race info stamping
horseID_hiseiki_v1.py	Unofficial	v1	Red-link output support
jockeyID_seiki_v1.py	Official	v1	Exhaustive rolling method
jockeyID_hiseiki_v1.py	Unofficial	v1	5-digit zero-fill formatting; dedup
horse_data_hiseiki_v1.py	Unofficial	v1	Timeout handling; 3 retries
jockey_data_hiseiki_v1.py	Unofficial	v1.1 (No.32)	The crystallization of 32 attempts in one day
jockey_data_seiki_v3.py	Official	v3 (No.2)	The version that established 3-stage logic
horse_data_seiki_v35.py	Official	v35 (No.3)	The crystallization of 35 terminal revisions
hiseiki_keibayosou_app_v8.py	Unofficial	v8 (No.17)	4-level jockey name fallback
seiki_keibayosou_app_v5.py	Official	v5 (No.09)	Official 4-script unified one-shot mode
horseID_reisugo_v1.py	Post-race	v1 (No.05)	Filters out noise from past winners
jockeyID_reisugo_v1.py	Post-race	v1 (No.02)	Supports both result.html formats
horse_data_reisugo_v1.py	Post-race	v1 (No.04)	3-stage logic applied
jockey_data_reisugo_v1.py	Post-race	v1 (No.03)	Post-race accumulated DB standalone
reisugo_keibayosou_app_v1_hikaku.py	Post-race	v1 (No.03)	For post-race entry sheets
yosou_hikaku_v3.py	Verification	v3.15 (No.21)	3-pipeline comparison; cumulative hit rate

V2 (Mar. 10–31, 2026) — New and Major Revisions — 8 Scripts

Filename	Pipeline	Final Ver	Description
seiki_keibayosou_app_v5.py	Official	v5 (No.18)	Base score double-count fix; virtual 9-race history; etc.
hiseiki_keibayosou_app_v9.py	Unofficial	v9 (No.22)	10 scoring logic items ported from seiki_v5 No.10
reisugo_keibayosou_app_v1_hikaku.py	Post-race	v5 (No.10)	2 hypotheses rejected; v5 official confirmed
yosou_hikaku_v3.py	Verification	v3.20 (No.27)	Master xlsx full overwrite regeneration method
osushi_meiken_v2.py	Official	v2.0 (No.05)	[NEW] "Favorite bet" app; fully independent
osushi_meiken_hiseiki_v1.py	Unofficial	v1.0 (No.02)	[NEW] Unofficial-pipeline favorite bet
make_yosou_batch.py	All pipelines	v1 (No.01)	[NEW] Auto-generate prediction batches
make_batch.py	Verification	v1 (No.03)	[NEW] Auto-generate hikaku batches

Author: yuji　　Development machines: Ubuntu 24.04 LTS (i7 for production / Celeron for development)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up