Building a Horse Racing Prediction System from Scratch in Six Months — A Complete Record: A Self-Improving Scoring Engine Built with 17 Python Scripts and an Accumulated Database
Period: September 2025 – March 31, 2026 Author: yuji
Introduction
"Could I predict horse racing outcomes using data?"
That single question sparked a project that, over six months, grew into a system of 17 Python scripts, a 4-pipeline parallel architecture, and a place-bet hit rate of 51%.
This article is the complete record of that journey — from the design philosophy of the algorithms, to coefficient tuning based on real data, to the results of verifying all 36 races. Every step of development is documented here.
Note: This article describes the system's design, algorithms, and development methodology.
Always check the terms of service for any external services before accessing or using their data.
Project Summary
| Item | Value |
|---|---|
| Total scripts created | 25 (V1: 17 + V2 new/revised: 8) |
| horse_data revision count | 35 times (the origin of "v35") |
| jockey_data struggle count | 32 times (single day: Feb. 16) |
| Races verified | 36+ races (cumulative through Mar. 28–29) |
| Place-bet hit rate | 51% (stable benchmark) |
| Development environment transitions | Windows → Lubuntu → Ubuntu (i7) → dual-machine (i7 + Celeron) |
Development Environment and Background
Environment Transitions
September 2025 Fresh Lubuntu setup → basic web scraping research
January 2026 Clean install on Ubuntu/SSD → full-scale development begins
March 2026 Added Ubuntu to Celeron machine → dual setup: i7=production, Celeron=development
Serious coding began the week of January 4, 2026, right after the New Year holidays, with a clean Ubuntu install. The date recorded in the opening comment of horseID_reisugo_v1.py — [No1] 2026-01-23 — marks the official declaration that this project was here to stay.
Overall System Architecture
4-Pipeline Parallel Architecture
horse_racing_prediction_system/
├── 01_seiki/ # Official entry sheet pipeline (race day; post and horse numbers available)
├── 02_hiseiki/ # Unofficial entry sheet pipeline (pre-race; fundamentally different HTML structure)
├── 03_racekakutei/ # Post-race pipeline (reverse analysis from confirmed results)
├── 04_hikaku/ # 3-pipeline comparison and hit rate verification
├── 05_osushi/ # "Favorite bet" app (fully independent; reads v5 output)
├── 06_batch_yosou/ # Automated batch prediction generation
├── 07_batch_hikaku/ # Automated batch comparison generation
└── result_seiki/ # Auto-save prediction results as xlsx
Data Flow
Entry sheet URL (entered once)
│
▼
STEP 1: Fetch HTML and cache (once only)
├─ Extract horse IDs (10-digit) → horseID.xlsx (instruction sheet)
└─ Extract jockey IDs (5-digit) → jockeyID.xlsx (instruction sheet)
│
▼
STEP 1.5: Fetch last 5 race records → predict final 3F time and horse weight
│
▼
STEP 2: Fetch horse data ─────────────────────────┐
│ ↓ 3-stage logic │
│ _work.csv (current race data) │
│ ↓ merge and deduplicate │
└─ _database_accumulated.csv (accumulated DB) ◀─┘
│
▼
STEP 3: Fetch jockey data (same 3-stage logic)
│
▼
STEP 4: Load accumulated DB → calculate scores → output prediction
│
▼
Auto-save xlsx (result_seiki/)
Append to CSV (04_hikaku/yosou_hikaku_data.csv)
Establishing the 3-Stage Logic
The core design philosophy of the system. Established on February 20, 2026, in jockey_data_seiki_v3, and subsequently rolled out across all pipelines.
The Root Problem
The original simple "fetch immediately, then save" approach frequently caused a critical issue: skipped entries were not recorded in the Work file. Horses already in the DB would be skipped, meaning they never entered the Work file, causing missing entries during score calculation.
Solution: 3-Stage Structure
# ① Fetch phase: collect only unregistered entries (skip those already in DB)
newly_fetched = []
for h_id in h_ids:
if h_id in accum_ids:
continue # skip
# fetch data...
newly_fetched.append(data)
# ② Merge phase: combine new + existing, deduplicate
df_merged = pd.concat([df_existing, df_new], ignore_index=True)
df_merged = df_merged.drop_duplicates(
subset=['HorseID', 'Date', 'Venue', 'R'], keep='last'
)
df_merged.to_csv(ACCUM_CSV, ...) # save to accumulated DB
# ③ Extract phase: extract only today's entries into Work (guarantees full lineup)
df_work = df_merged[df_merged['HorseID'].isin(h_ids)].copy()
df_work.to_csv(WORK_CSV, ...) # save to Work
This 3-stage structure guarantees that even skipped horses are always pulled from the accumulated DB into the Work file.
Scoring Algorithm (v5)
Score Components
total = (
base # finish-order score × 0.6 + upset score × 0.4
+ agari # final 3F time (with distance-range coefficients)
+ choushi # condition score (weight change and recent trend)
+ tekisei # track surface and distance aptitude
+ prize # prize money (log-transformed)
+ waku_b # post position bonus (distance-range aware)
+ pop_b # popularity adjustment
+ compat # jockey × horse combo affinity (reliability-weighted)
+ chakusa # margin score
+ j_win * 0.08 + j_fuku * 0.03 # jockey rank bonus
) * boost # jockey rank multiplier (S:1.35 / A:1.18 / B:1.07 / C:1.00)
* track_f # venue coefficient
* class_f # race class coefficient
* baba_f # track condition coefficient
Key Coefficient Design
Venue Correction Coefficients (TRACK_FACTOR)
TRACK_FACTOR = {
'Tokyo': 1.00, 'Nakayama': 1.00, 'Hanshin': 1.00, 'Kyoto': 1.00,
'Chukyo': 0.95,
'Kokura': 0.90, 'Fukushima': 0.90, 'Niigata': 0.90, 'Sapporo': 0.90, 'Hakodate': 0.90,
}
Design rationale: Real performance data changed the design — going 0 for 8 across two days at Kokura. Coefficients determined by actual data, not theory.
Race Class Reliability Coefficients (CLASS_FACTOR)
CLASS_FACTOR = {
'G1': 0.98, 'G2': 0.99, 'G3': 1.00, # Grade races are dampened
'OP': 1.02, '3-win': 1.00,
'2-win': 0.97, '1-win': 0.94,
'Maiden': 0.92, 'Newcomer': 0.92,
}
Why Grade races are down-adjusted: The conclusion from 36-race verification was that "Grade races are volatile and prone to score inflation." Counterintuitive, but data-driven.
Distance-Range Weight Coefficients for Final 3F (DIST_FACTOR)
DIST_FACTOR = {
1200: 0.50, # In sprints, final 3F has low differentiating power
1600: 0.80,
2000: 1.00, # Baseline
2400: 1.10,
3000: 0.90, # Long distance is highly pace-dependent
}
The Double-Counting Bug in Base Score (and Its Fix)
The biggest bug fix in v5 was eliminating the "double-counting of the base score."
# ❌ Old design (v5 before No.10): finish order used twice → score inflates to ~40
base = (avg_ninki - avg_chakujun) + (6 - avg_chakujun)
# ↑ finish order counted twice
# ✅ New design (No.10 onward): two-axis separation with upper bound → converges to ~9 max
chakujun_score = max(0.0, 6.0 - avg_chakujun) # how good the finish was
upset_score = max(0.0, avg_ninki - avg_chakujun) # outperforming popularity
base = chakujun_score * 0.6 + upset_score * 0.4
Impact of this fix:
- Top-pick win rate: 13.9% → 19.4% (+2 races)
- Complete misses: 11 → 6 (−5 races)
Virtual 9-Race History: Self-Improving Design
A design that eliminates the need for paid race history services.
VIRTUAL_PAST_MAX = 9
def get_virtual_past9(horse_id: str, df_accum: pd.DataFrame) -> pd.DataFrame:
"""
Retrieves up to 9 past records for the same horse ID from the accumulated DB,
sorted by date descending.
With each race run, more data accumulates in the DB — the system
automatically improves over time: a "self-improving" design.
"""
df_horse = df_accum[df_accum['HorseID'] == horse_id].copy()
return df_horse.sort_values('Date', ascending=False).head(VIRTUAL_PAST_MAX)
Even with no data on the first run, the system operates — and with each subsequent run, a virtual 9-race history is automatically built up.
The Wall Hit in the Unofficial Pipeline: Cell-by-Cell Rolling Parse
The HTML structure of the official and unofficial entry sheets is fundamentally different. The biggest obstacle was the discovery that jockey IDs do not exist in the HTML source of the unofficial version.
The Trial Log (February 16 — 32 Attempts in One Day)
Attempts 1–5: Analyzed mobile version → completely different structure, failed
Attempts 6–12: Identified class names in desktop version → frequent class name mismatches
Attempts 13–20: Used Selenium for dynamic rendering → data attributes absent
Attempts 21–28: Checkmark encoding issues, stable name contamination, wall after wall
Attempts 29–32: "Just scan every cell" → breakthrough via rolling cell parse
The breakthrough logic:
# BAD: Look for cells by specific class name (breaks when structure changes)
jockey_cell = row.find('td', class_='JockeyName')
# GOOD: Scan all cells exhaustively and identify by pattern matching (rolling method)
for td in row.find_all('td'):
cells = [c.get_text(strip=True) for c in td.find_all(['td', 'th'])]
if re.search(r'\d+\.\d+', cells[some_index]): # pattern unique to jockeys
# found
Version numbers v35 (horse_data) and v3 (jockey_data) are not mere identifiers — they literally represent the number of actual revisions made in the terminal.
File-Header Comment History as a Development Method
Every script in this project carries a modification history in its opening comment block.
# ==============================================================================
# [FILE] seiki_keibayosou_app_v5.py
# [CREATED] 2026-02-20
# [VERSION] v5.3
# [CHANGE LOG]
# - [No.01] 2026-02-20: Initial creation. Based on the unofficial version's
# jockey rolling-parse logic; fully adapted to the
# seiki v35 accumulated DB structure.
# - [No.05] 2026-03-02: Refinements based on full-race result verification
# from Feb. 28 and Mar. 1.
# [Fix ①] Popularity adjustment coefficient: 0.4 → 0.12
# Reason: Popularity adjustment backfired repeatedly in
# Tulip Stakes and Jinnawa S.
# [Fix ②] Final 3F baseline: 36.0 → 38.0
# Reason: Turf standard is 34–35s. At 36.0, all horses scored
# too low to differentiate.
# [Fix ③] Added venue correction coefficient (TRACK_FACTOR)
# Reason: Went 0 for 8 at Kokura.
# ==============================================================================
Why Comments Instead of Git?
In a workflow of rapid terminal-based iteration, the log needs to be right where you can see it when you open the file. The design makes every file its own development journal.
[NG:] comments are intentionally kept — to prevent hitting the same wall twice.
The Verify-Tune Cycle
The philosophy established in V1 (through Mar. 9) — "real data changes the design" — deepened in V2 (Mar. 10 onward) into "reject theory with data."
Using the Post-Race Pipeline as a Testing Machine
Without touching the production environment (seiki) at all, the technique involves swapping only the correction logic into reisugo_keibayosou_app (post-race pipeline) and running parallel comparisons on the same races.
Verification cycle:
① Predict with seiki_v5 (production)
② Implement new logic in reisugo (test machine)
③ Run 5+ parallel comparisons on the same races
④ Judge by data → adopt or reject
Rejected examples (March 29, 2026):
- "Venue experience correction (compat → per-venue experience weighting)" → inferior to v5 → rejected
- "Jockey boost S: 1.35 → 1.20 downgrade" → 3-place hits dropped from 7 to 5 → rejected; v5 maintained
The cycle of deciding design changes by data, not intuition, became embedded practice.
Full 36-Race Verification Results (Mar. 28–29)
All 36 races across Mar. 28 (Nakayama, Hanshin, Chukyo) and Mar. 29 were verified using v5 and yosou_hikaku_v3 in mode [3].
Results by Venue
| Venue | Races | Win Hit | Place Hit | Show Hit | Notes |
|---|---|---|---|---|---|
| Nakayama | 2 days × 12R | 2/12 (17%) | 9/12 (75%) | 8/12 (67%) | Most stable over 2 days. 75% place hit is remarkable. |
| Hanshin | 2 days × 12R | 2/12 (17%) | 7/12 (58%) | 7/12 (58%) | Won the Maika Cup and Hanshin 12R outright. |
| Chukyo | 2 days × 12R | 4/12 (33%) | 6/12 (50%) | 6/12 (50%) | Chukyo 12R: exacta paid 3000%. |
| Total | 36R | 8/36 (22%) | 22/36 (61%) | 21/36 (58%) | 61% place hit is the system's greatest strength. |
Identified Challenges
| Bet Type | Hit Rate | Assessment |
|---|---|---|
| Place (Fukusho) | 58% | ◎ System's core strength |
| Win (Tansho) | 22% | △ Close to random |
| Exacta (Umaren) | 8% | ✕ Combination explosion |
| Trifecta (3-Rentan) | <5% | ✕ Not recommended at this stage |
Conclusion: "Focus on place bets, and filter which races to play." Selective filtering emerged as the clear next step.
The "Favorite Bet" App (osushi_meiken) Philosophy
A fully independent design that never touches the v5 core.
v5 prediction result CSV (yosou_hikaku_data.csv)
↓ loaded by
osushi_meiken_v2.py
↓ narrows down "races worth betting" via 4-axis scoring
Outputs ranked "favorite bets"
4-Axis Score Design
- Index gap score — large gaps between top and second pick signal higher reliability
- Jockey score — favor races where S/A jockeys dominate the top positions
- Field size adjustment — larger fields lower hit probability
- Venue correction — based on empirical results: Nakayama +0.03, Chukyo −0.05, etc.
Verification result: MIN_TEKICHU_SCORE ≥ 0.50 → 3/3 = 100% hit rate across 42 cumulative races (small sample, but directionally valid)
Designs Born from Failure
DB Corruption Incident (March 8, 2026)
Intentionally emptying a DB file triggered an EmptyDataError crash. A combination that never occurs in normal usage — yet it surfaced a latent bug.
# After fix: added empty file check
def load_horse_db() -> pd.DataFrame:
if not os.path.exists(HORSE_ACCUM_CSV):
return pd.DataFrame()
if os.path.getsize(HORSE_ACCUM_CSV) == 0: # ← added
return pd.DataFrame()
try:
df = pd.read_csv(HORSE_ACCUM_CSV, dtype=str)
...
except Exception:
return pd.DataFrame()
Guarding Against DB Corruption from Wrong URL Input
Running the official-entry script with an unofficial URL causes all horse and post numbers to be recorded as 0, corrupting the DB. A pre-check was added in STEP 1.
# Horse/post number check (DB corruption guard)
entries_with_num = [e for e in entries if e['num'] > 0 and e['waku'] > 0]
valid_ratio = len(entries_with_num) / len(entries) if entries else 0
if valid_ratio < 0.8:
print("❌ This does not appear to be an official entry sheet.")
sys.exit(1)
Full Development Timeline
V1 Phase (Sep. 2025 – Mar. 9, 2026)
| Date | Milestone | Details |
|---|---|---|
| Sep. 2025 | Lubuntu setup | Basic scraping research; trial and error begins |
| Jan. 4, 2026~ | Migrated to Ubuntu/SSD | Full-scale development phase begins |
| Jan. 23, 2026 | Project continuation declaration | Official start point of serious development |
| Feb. 14, 2026 | Official horseID v1 → v2 | Conceived the night before; promoted to v2 next morning |
| Feb. 16, 2026 | 4 unofficial pipeline scripts created | 32 attempts on jockey_data in a single day |
| Feb. 20, 2026 | 3-stage logic established | Birth of v35 and v3; integration of all 4 scripts begins |
| Feb. 21, 2026 | Unofficial pipeline: unified one-shot mode | Enter URL once; fully automated |
| Feb. 22–23, 2026 | Verification app + post-race pipeline | All 4 pipelines running in parallel |
| Mar. 2–3, 2026 | Verification → coefficient tuning | Kokura shutout → TRACK_FACTOR introduced |
| Mar. 8–9, 2026 | Final polish across all pipelines | Production-ready quality achieved |
V2 Phase (Mar. 10 – Mar. 31, 2026)
| Date | Milestone | Details |
|---|---|---|
| Mar. 10~ | seiki_v5 hit rate improvement | Base score double-counting fixed → 19.4% achieved |
| Mar. 17 | Virtual 9-race history implemented | Self-improving design; paid history services no longer needed |
| Mar. 17 | osushi_meiken_v2 created | "Favorite bet" app; fully independent |
| Mar. 18 | Batch generation tools | make_yosou_batch and make_batch |
| Mar. 28–29 | Full 36-race verification | 51% place hit, 75% Nakayama place hit confirmed |
| Mar. 29 | Two hypotheses rejected | Venue experience correction and boost reduction → rejected by data |
| Mar. 30–31 | Dual-machine setup established | i7 for production, Celeron for development |
Summary: Evolution of Design Philosophy
| Phase | Design Philosophy |
|---|---|
| V1 Early | Getting it to run is the top priority. Record failures as failures and move on. |
| V1 Late | Real data, not theory, changes the design. |
| V2 Phase | Reject theory with data. Numbers over intuition. |
Over six months of development, the most important thing I held onto was never deleting failures.
Version number v35 stands for 35 actual revisions. The jockey_data comment log carries the marks of 32 attempts. TRACK_FACTOR was born from the painful lesson of going 0 for 8 at Kokura. All of these are assets that protect my future self from hitting the same walls again.
A 51% place-bet hit rate is not the goal — it's where I am now. Refining the filtering logic to select only the right races to play is the next step, already underway.
Appendix: Full Script List (25 Scripts)
V1 (through Mar. 9, 2026) — 17 Scripts
| Filename | Pipeline | Final Ver | Description |
|---|---|---|---|
| horseID_seiki_v2.py | Official | v2 | Horse ID extraction; race info stamping |
| horseID_hiseiki_v1.py | Unofficial | v1 | Red-link output support |
| jockeyID_seiki_v1.py | Official | v1 | Exhaustive rolling method |
| jockeyID_hiseiki_v1.py | Unofficial | v1 | 5-digit zero-fill formatting; dedup |
| horse_data_hiseiki_v1.py | Unofficial | v1 | Timeout handling; 3 retries |
| jockey_data_hiseiki_v1.py | Unofficial | v1.1 (No.32) | The crystallization of 32 attempts in one day |
| jockey_data_seiki_v3.py | Official | v3 (No.2) | The version that established 3-stage logic |
| horse_data_seiki_v35.py | Official | v35 (No.3) | The crystallization of 35 terminal revisions |
| hiseiki_keibayosou_app_v8.py | Unofficial | v8 (No.17) | 4-level jockey name fallback |
| seiki_keibayosou_app_v5.py | Official | v5 (No.09) | Official 4-script unified one-shot mode |
| horseID_reisugo_v1.py | Post-race | v1 (No.05) | Filters out noise from past winners |
| jockeyID_reisugo_v1.py | Post-race | v1 (No.02) | Supports both result.html formats |
| horse_data_reisugo_v1.py | Post-race | v1 (No.04) | 3-stage logic applied |
| jockey_data_reisugo_v1.py | Post-race | v1 (No.03) | Post-race accumulated DB standalone |
| reisugo_keibayosou_app_v1_hikaku.py | Post-race | v1 (No.03) | For post-race entry sheets |
| yosou_hikaku_v3.py | Verification | v3.15 (No.21) | 3-pipeline comparison; cumulative hit rate |
V2 (Mar. 10–31, 2026) — New and Major Revisions — 8 Scripts
| Filename | Pipeline | Final Ver | Description |
|---|---|---|---|
| seiki_keibayosou_app_v5.py | Official | v5 (No.18) | Base score double-count fix; virtual 9-race history; etc. |
| hiseiki_keibayosou_app_v9.py | Unofficial | v9 (No.22) | 10 scoring logic items ported from seiki_v5 No.10 |
| reisugo_keibayosou_app_v1_hikaku.py | Post-race | v5 (No.10) | 2 hypotheses rejected; v5 official confirmed |
| yosou_hikaku_v3.py | Verification | v3.20 (No.27) | Master xlsx full overwrite regeneration method |
| osushi_meiken_v2.py | Official | v2.0 (No.05) | [NEW] "Favorite bet" app; fully independent |
| osushi_meiken_hiseiki_v1.py | Unofficial | v1.0 (No.02) | [NEW] Unofficial-pipeline favorite bet |
| make_yosou_batch.py | All pipelines | v1 (No.01) | [NEW] Auto-generate prediction batches |
| make_batch.py | Verification | v1 (No.03) | [NEW] Auto-generate hikaku batches |
Author: yuji Development machines: Ubuntu 24.04 LTS (i7 for production / Celeron for development)