【松尾研LLMコンペ】HLEスコア向上のための高難易度数学データ合成パイプライン構築録 [Matsuo Lab LLM Comp] A Report on Building a High-Difficulty Math Data Synthesis Pipeline for HLE Score Improvement

Last updated at 2025-11-11Posted at 2025-11-11

English follows Japanese.

1. 背景

本記事では、松尾研のLLM開発コンペ2025にて私が担当した、データ合成に取り組んだ際のアプローチや結果について紹介し、得られた知見を共有したいと思います。
所属チーム名：ねこ

2. 目的

我々のチームではHumanity's Last Exam（HLE）のスコアを基盤モデルから上げるために、SFT（教師ありファインチューニング）に利用するデータを合成するというミッションを担っていました（公開データセットを探すチームも別途存在）。

公開データセットだけに頼らなかった理由としては、

HLEは数学、化学等の学術分野ごとに難易度が非常に高い問題から成るベンチマークテストであるため、より難易度の高いデータセットを用意する必要があると考えたこと
採用した基盤モデルと公開データセットの作成日付によっては既に公開データセットが利用されており、モデルに二重に与えることになってしまう（効果が薄れる、あるいは過学習する）恐れがあったこと

また、SFTで利用するために、問題とその正答だけでなく、CoT（Chain of Thought）の過程も含めて合成する必要がありました。これは、単なる「問題と答え」のペアだけを学習させる（正解を「暗記」させる）のではなく、「どのように考えてその答えに至ったか」という推論プロセス自体をモデルに学習させるためです。
HLEのような高難易度の問題では、答えを導くための中間的なステップや論理展開が非常に複雑になります。CoTをデータに含めることで、モデルは複雑な問題を解くための思考パターンや論理的なステップの踏み方を模倣して学習することができ、未知の応用問題に対する汎用的な推論能力の向上を期待できます。
今回は時間とリソースの関係で、HLEの中でも分野別で最も割合の高い数学のデータ合成にフォーカスしました。

3. 方法

全体像としては、問題生成→回答生成→CoT生成の3つのステップに分けられます。
利用したコードはチームねこのリポジトリのこちらのフォルダにまとめられています。

ステップ全体を通して、deepseek-r1-0528を、OpenRouterのAPI経由で利用しました。
これは無料枠が存在し、かつ我々が選んだ基盤モデル（Qwen3-235B-A22B）よりもHLEのスコアが高いホワイトリスト掲載の唯一のモデルだったためです。
一方で、この手法ではOpenRouterのAPI利用制限とどのステップでも1問それなりに時間を要したため、チーム内のメンバーにも協力いただきました。

3.1 問題生成

問題生成では、こちらの論文で紹介されていたMath Evol Instructという手法を一部我々の問題に合わせて、改変して利用しました。
簡潔に申し上げると、プロンプトを用いて元の問題をより難化させるという手法です。下記が実際に用いたプロンプトです。

Math Evol Instruct

You are an expert in creating complex mathematical problems. Your task is to rewrite the given instruction to make it more challenging. 

#Instruction#
{problem}

Follow these steps precisely.
Step 1: Understand the core concept and structure of the "#Instruction#". Identify the key elements such as variables, conditions, participants, actions, or processes that can be manipulated to increase complexity. Also, recognize the theme of the instruction and ensure it remains consistent throughout the evolution. 

Step 2: Formulate a comprehensive plan to increment the complexity of the "#Instruction#" based on the identified elements in Step 1. The plan should involve modifying or expanding at least three components from the list. It is crucial to ensure that all components in the instruction are logically interconnected and that the complexity increase is coherent and justified. The plan should avoid introducing variables or conditions without clear criteria for determining their values or without contributing to the overall complexity. In this step, consider adding more real-world constraints and dependencies between variables to make the problem more challenging. And you can also add more constraints, concretizing, increasing reasoning. 

Step 3: Implement the plan step by step to create the "#Rewritten Instruction#". Ensure the rewritten instruction maintains a logical sequence and avoids ambiguity or confusion. If additional variables or conditions are introduced, provide clear and unambiguous methods or criteria for determining their values. The "#Rewritten Instruction#" should not exceed the original "#Instruction#" by more than 30 words to ensure readability and comprehension.

Step 4: Review the "#Rewritten Instruction#" thoroughly to identify any unreasonable elements or inconsistencies. Make sure the "#Rewritten Instruction#" is a more complex version of the "#Instruction#". and that it accurately reflects the intended increase in complexity. Adjust any part of the instruction that may lead to misunderstanding or ambiguity, and provide the "#Finally Rewritten Instruction#" without any supplementary explanation.
Please reply strictly in the following format:

Step 1
#Elements Identified#:
...
Step 2
#Plan#:
...
Step 3
#Rewritten Instruction#:
...
Step 4
#Finally Rewritten Instruction#:
...

この手法を下記の公開データセットで得られた合計1177問の数学問題に対して適用しました。

対象データセット: OlympiadBenchのSubset
OE_OT_maths_en_COMP: 674問
TP_TO_maths_en_COMP: 503問

3.2 回答生成

続いて回答生成ですが、得られた回答が本当に正しい回答であることを担保する必要があるため、こちらの論文で採用されていた堅牢な手法を一部改変して、採用しました。
こちらの手法はLLMが一度に処理できる思考量の限界と、しばしば発生する論理的な誤りをカバーするため、生成と検証を複数回繰り返して、最終的に正しい（確率が高い）回答のみを正答として採用するというものです。
具体的には下記のループを繰り返し、Step3で3回連続検証をパスすると正答として受け入れる、という形にしました。

Step1: 初期解の生成（Solver）
Step2: 自己改善（Solver）
Step3: 検証とバグレポート（Verifier）
Step4: バグレポートに基づく修正（Solver）
※Solver, Verifierともに同じモデルを使用し、プロンプトで役割を指定しました。

3.3 CoT生成

最後にCoT生成ですが、こちらはプロンプトによるCoT生成→添削という2ステップでCoT生成を行いました。
残された時間も僅かとなったため、プロンプトを自作しました。
数回試行錯誤する中で発見した問題（フォーマットに沿っていない、正答の抽出失敗、英語以外の言語の混入）に対処するため、最終的に下記のプロンプトを採用しました。

CoT Generation Prompt

INITIAL_PROMPT_TEMPLATE = """
You are a brilliant mathematician. Your task is to solve the following math problem step-by-step.
**You MUST conduct your entire thinking process and reasoning in English only.**
First, provide your entire thought process wrapped in <think> and </think> tags.
After the thought process, state the final answer, wrapping the final numerical result by LaTeX expression in $\\boxed{}$.

Please follow these steps within your thinking process:
1.  Analyze the problem and clarify the definitions of the terms used.
2.  List all the constraints of the problem.
3.  Devise a strategy to find the solution.
4.  Execute the strategy with logical reasoning and step-by-step calculations. Explain why each step is necessary.
5.  If there are any key insights or logical leaps that help narrow down the candidates, explain them in detail.
6.  Derive the final conclusion that leads to the answer.

Here is an example of the required format:
Question: What is 1 + 1?
Answer: <think>I need to calculate the sum of 1 and 1. This is a simple addition operation. The result of adding 1 and 1 is 2.</think>1 + 1 is $\\boxed{2}$.

---

Now, solve the following problem.

Question: 
"""

SELF_CORRECTION_PROMPT_TEMPLATE = """
You are a meticulous editor. Your task is to review and finalize the following draft solution.
Ensure the final output strictly follows these three rules:
1.  **Language:** The entire thinking process inside the `<think>`...`</think>` tags MUST be in English only. Remove any non-English characters (e.g., Chinese, Japanese).
2.  **CoT Tags:** The thinking process must be correctly enclosed in one set of `<think>` and `</think>` tags.
3.  **Answer Box:** The final numerical answer must be enclosed in `\\boxed{}`.

If the draft is already perfect, output it as is. If it needs corrections, provide the corrected, final version of the entire response.

### Draft Solution to Review:
"""

4. 結果

4.1 得られたデータ数（パイプラインの歩留まり）

各ステップでのデータ数の遷移（歩留まり）は以下のようになりました。

問題生成
- 入力: 1177件
- 生成成功: 1148件（29件が「Extraction Failed」）
回答生成
- 入力: 1148件
- 生成成功: 518件（601件が「NO SOLUTION FOUND」、29件が空値）
- 歩留まり: 約45.1%
CoT生成
- 入力: 526件
- 品質チェック通過: 21件
- 歩留まり: 約4.0%

まず、問題生成に関しては数問を目視で確認した結果、確かに問題の難化（抽象度の上昇、制約条件の追加、異なる数学的アプローチの要求）が確認できました。
最後のCoT生成では、下記の観点で品質をチェックし、人力で修正可能な体裁の崩れは修正しました。
・回答生成の正答とCoTの正答が一致していること
・フォーマットに沿っていること（<think></think>タグで囲まれていること、\boxed{}に答えが囲まれていること）
・英語以外の他言語（特に中国語）が混入していないこと
・CoTが十分な長さを確保しており、回答までの過程に明らかに欠落や矛盾がないこと

最終的に、1177件の元データから合成できたデータセットは21問でした。
https://huggingface.co/datasets/neko-llm/HLE_RL_Olympiadbench-v2

4.2 問題生成に要した時間

問題生成：1分～5分/1問
回答生成：15分～30分/1問
CoT生成：3分～5分/1問
（参考：もし1000問をこのパイプラインで生成しようとした場合、回答生成だけで最短でも 1000 * 15分 = 15000分 ≒ 250時間が必要になる計算でした。）

5. 考察・振り返り

当初は1000問のデータセットを構築することを目標に置いておりました、結果的には僅か21件のデータセットを得るに留まりました。この経験から得られた考察と、再度同じ取り組みを行う際の改善点をまとめます。

5.1 データ合成は生成より検証コストが遥かに大きい

今回のパイプラインのボトルネックは、処理時間的にも、最終的なデータ数（歩留まり）的にも、圧倒的に「回答生成」と「CoT生成」の検証・品質担保のステップにありました。

回答生成（歩留まり約46%）

「NO SOLUTION FOUND」が多発した要因として、(1)Evol-Instructで生成した問題が、Solver役のdeepseek-r1-0528の能力を遥かに超える難易度になっていた、(2)1問の解答生成のために、3つのエージェントを同時に接続し、プロンプトも複数回投げるため、APIからの応答が不安定になるリスクがあった、という2点が挙げられます。
特に（2）は本当に解けなかったのかAPIからの応答が不安定だったのかの切り分けからスタートし、リトライすることでいくらかカバーはできたものの、想定よりも本ステップに時間を要する結果となってしまいました。

CoT生成（歩留まり約4%）

品質チェックで大半のデータが失われました。特に致命的だったのは、「2段階（生成→添削）プロンプトの結果、最終的なCoTが『1段階目の回答をチェックする』内容になってしまい、元の問題に答えていない」というケースが100問以上発生したことです。
これは、CoT生成プロンプトの設計ミスであり、LLMに対して「何をすべきか」を明確に定義できていなかったことが原因です。私が数問で検証した際にCoTを具にチェックしていれば防げたミスでしたので、とても悔やまれます。

ここから得られるテイクアウェイは、「LLMは指示通りに生成はするが、その出力品質（フォーマット、論理性、タスクへの準拠）は全く保証されない」という現実です。特にHLEのような高難易度タスクでは、この品質のバラツキが顕著になったと考えられます。

現状のLLMを用いたデータ合成パイプラインでは、このバラツキを前提とし、各ステップでの厳密な自動検証と、最終的な目視チェックが不可欠であり、これを前提としたタイムラインを組む必要がある、という教訓を得られました。

5.2 もし再度取り組むなら

もし再度同様の取り組みを行うならば、以下の点について取り組みたいです。

1. パイプラインの改善（歩留まり向上）

回答生成

検証方法の柔軟化: 「3回連続パス」のようなルールだけでなく、複数の回答を生成させて多数決（アンサンブル）で正答を判断する、あるいは異なるモデル間でクロスチェックさせる、といった手法も先行研究では利用されていたのですが、今回は試せなかったので試してみたいです。

CoT生成

プロンプトの高度化: 自作プロンプトを改善し、Few-shot learning（高品質なCoT例をプロンプトに含める）を導入することで、出力フォーマットと論理展開の質を安定させられる可能性が高いと考えております。
タスクの分離: 「正答（\boxed{}）」と「CoT」を一度に生成させる他に、まず正答を与えた上で、「この答えに至るCoTを生成せよ」手法も確認されたので、CoTの生成に失敗した、或いは回答が一致しなかったケースについては、本手法で一定程度カバーできると考えています。
CoT品質の自動評価: 生成されたCoTの論理性を、別のLLMにスコアリングさせ、一次的なフィルタリングを行うことで目視チェックの負担を減らす工夫も行えると考えております。

2. SFTへの影響検証

今回生成した21件のデータは、数が少ないとはいえ、非常に困難なパイプラインを通過した高品質（あるいは超高難易度）なデータである可能性があります。
HLEスコア全体への影響は小さくとも、数学カテゴリ（或いはさらにその中の局所カテゴリ）において正答率がわずかに向上する、といった局所的な効果があったかもしれません。
まずは少量データでSFTを実行し、モデルのパフォーマンス変化を観察すること自体が、次のデータ合成戦略を立てる上で重要な知見になったはずであり、それを最初のプランに織り込めていなかった点、また結果的にプロジェクトのタイムラインと優先度の都合上、そこまで実行できなかった点が悔やまれます。次回は是非その点を試してみたいです。

6. 参考文献

Huang, Y., & Yang, L. F. (2025). Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2507.15855

Humanity’s Last Exam. (2025). SEAL Leaderboard. https://scale.com/leaderboard/humanitys_last_exam

‌Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., & Zhang, D. (2023, August 18). WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. ArXiv. https://doi.org/10.48550/arXiv.2308.09583

7. 謝辞

本プロジェクトではH100 GPU (80GB) 36基 × 約30日分という大規模な計算資源を利用させていただきました。
このような機会を無償で提供くださった松尾研、並びに国立研究開発法人新エネルギー・産業技術総合開発機構、SAKURA internet、他関係者の皆様には大変感謝しております。
また、全員が初対面でしたが、チームねこに所属して本PJに取り組めてよかったと感じております。結果は惜しくも決勝進出ならずだったことが尚のこと悔しいです。特にチームリーダーである本田さん(@Damin3927)はチームメンバーほぼ全員が初対面でバックグラウンドも異なり、かつオンライン環境の中、見事にチームの心理的安全性を確保しつつ、技術力と判断力に裏打ちされたリーダーシップを発揮してくださったと感じております。この場を借りて、感謝申し上げます。

※本プロジェクトは、国立研究開発法人新エネルギー・産業技術総合開発機構(以下「NEDO」)の「日本語版医療特化型LLMの社会実装に向けた安全性検証・実証」における基盤モデルの開発プロジェクトの一環として行われます。

1. Background

In this article, I would like to introduce the approach and results of my work on data synthesis for the Matsuo Lab's LLM Development Competition 2025, and share the insights I gained.
Team Name: Neko

2. Objective

Our team's mission was to synthesize data for Supervised Fine-Tuning (SFT) to improve the Humanity's Last Exam (HLE) score of our base model (some of other team members were in charge of finding public datasets).

The reasons we didn't rely solely on public datasets were:

HLE is a benchmark test consisting of highly difficult problems in academic fields such as mathematics and chemistry, so we believed it was necessary to prepare an even more difficult dataset.
Depending on the creation dates of our adopted base model and the public datasets, there was a risk that those datasets had already been used in its training, and feeding them again during SFT would be redundant (diminishing returns or overfitting).

Furthermore, for SFT, it was necessary to synthesize not only the problems and their correct answers but also the Chain of Thought (CoT) process. This is not just to make the model "memorize" the correct answers, but to teach the model the inference process itself: "how to arrive at that answer."
In highly difficult problems like those in HLE, the intermediate steps and logical flow to derive the answer are extremely complex. By including CoT in the data, the model can learn to imitate the thought patterns and logical steps required to solve complex problems, and we can expect an improvement in its general reasoning ability for unseen applied problems.
This time, due to time and resource constraints, we focused on data synthesis for mathematics, which has the highest proportion of problems by field within HLE.

3. Method

The overall process was divided into three steps: problem generation -> answer generation -> CoT generation.
The code used is summarized in this folder in the Team Neko repository.

Throughout all steps, we used deepseek-r1-0528 via the OpenRouter API.
This was because it had a free tier and was the only whitelisted model with a higher HLE score than our chosen base model (Qwen3-235B-A22B).
However, this method faced OpenRouter API usage limits, and each step took a considerable amount of time per problem, so I received cooperation from other members within our team.

3.1 Problem Generation

For problem generation, we used a modified version of the Math Evol Instruct method introduced in this paper, adapting it partially to our specific problems.
In short, it is a method that uses prompts to make the original problems more difficult. The following is the prompt we actually used.

Math Evol Instruct

You are an expert in creating complex mathematical problems. Your task is to rewrite the given instruction to make it more challenging. 

#Instruction#
{problem}

Follow these steps precisely.
Step 1: Understand the core concept and structure of the "#Instruction#". Identify the key elements such as variables, conditions, participants, actions, or processes that can be manipulated to increase complexity. Also, recognize the theme of the instruction and ensure it remains consistent throughout the evolution. 

Step 2: Formulate a comprehensive plan to increment the complexity of the "#Instruction#" based on the identified elements in Step 1. The plan should involve modifying or expanding at least three components from the list. It is crucial to ensure that all components in the instruction are logically interconnected and that the complexity increase is coherent and justified. The plan should avoid introducing variables or conditions without clear criteria for determining their values or without contributing to the overall complexity. In this step, consider adding more real-world constraints and dependencies between variables to make the problem more challenging. And you can also add more constraints, concretizing, increasing reasoning. 

Step 3: Implement the plan step by step to create the "#Rewritten Instruction#". Ensure the rewritten instruction maintains a logical sequence and avoids ambiguity or confusion. If additional variables or conditions are introduced, provide clear and unambiguous methods or criteria for determining their values. The "#Rewritten Instruction#" should not exceed the original "#Instruction#" by more than 30 words to ensure readability and comprehension.

Step 4: Review the "#Rewritten Instruction#" thoroughly to identify any unreasonable elements or inconsistencies. Make sure the "#Rewritten Instruction#" is a more complex version of the "#Instruction#". and that it accurately reflects the intended increase in complexity. Adjust any part of the instruction that may lead to misunderstanding or ambiguity, and provide the "#Finally Rewritten Instruction#" without any supplementary explanation.
Please reply strictly in the following format:

Step 1
#Elements Identified#:
...
Step 2
#Plan#:
...
Step 3
#Rewritten Instruction#:
...
Step 4
#Finally Rewritten Instruction#:
...

We applied this method to a total of 1177 math problems obtained from the following public dataset:

Target Dataset: Subset of OlympiadBench
- OE_OT_maths_en_COMP: 674 problems
- TP_TO_maths_en_COMP: 503 problems

3.2 Answer Generation

Next is answer generation. Since we needed to ensure that the generated answers were truly correct, we adopted a partially modified version of the robust method used in this paper. This method covers the limits of an LLM's single-pass reasoning capacity and the logical errors that frequently occur. It involves repeating generation and verification multiple times, ultimately adopting only the answers that are (highly probable to be) correct. Specifically, we repeated the following loop, and an answer was accepted as correct only if it passed the verification (Step 3) 3 times consecutively.

Step1: Initial solution generation (Solver)
Step2: Self-improvement (Solver)
Step3: Verification and bug report (Verifier)
Step4: Correction based on bug report (Solver)
*We used the same model for both Solver and Verifier, specifying their roles via prompts.

3.3 CoT Generation

Finally, for CoT generation, we performed a two-step process: CoT generation by prompt -> correction. With little time remaining, I created my own prompt. To address problems discovered during several trial-and-error attempts (non-compliance with format, failure to extract the correct answer, inclusion of languages other than English), we finally adopted the following prompt.

Math Evol Instruct

You are an expert in creating complex mathematical problems. Your task is to rewrite the given instruction to make it more challenging. 

#Instruction#
{problem}

Follow these steps precisely.
Step 1: Understand the core concept and structure of the "#Instruction#". Identify the key elements such as variables, conditions, participants, actions, or processes that can be manipulated to increase complexity. Also, recognize the theme of the instruction and ensure it remains consistent throughout the evolution. 

Step 2: Formulate a comprehensive plan to increment the complexity of the "#Instruction#" based on the identified elements in Step 1. The plan should involve modifying or expanding at least three components from the list. It is crucial to ensure that all components in the instruction are logically interconnected and that the complexity increase is coherent and justified. The plan should avoid introducing variables or conditions without clear criteria for determining their values or without contributing to the overall complexity. In this step, consider adding more real-world constraints and dependencies between variables to make the problem more challenging. And you can also add more constraints, concretizing, increasing reasoning. 

Step 3: Implement the plan step by step to create the "#Rewritten Instruction#". Ensure the rewritten instruction maintains a logical sequence and avoids ambiguity or confusion. If additional variables or conditions are introduced, provide clear and unambiguous methods or criteria for determining their values. The "#Rewritten Instruction#" should not exceed the original "#Instruction#" by more than 30 words to ensure readability and comprehension.

Step 4: Review the "#Rewritten Instruction#" thoroughly to identify any unreasonable elements or inconsistencies. Make sure the "#Rewritten Instruction#" is a more complex version of the "#Instruction#". and that it accurately reflects the intended increase in complexity. Adjust any part of the instruction that may lead to misunderstanding or ambiguity, and provide the "#Finally Rewritten Instruction#" without any supplementary explanation.
Please reply strictly in the following format:

Step 1
#Elements Identified#:
...
Step 2
#Plan#:
...
Step 3
#Rewritten Instruction#:
...
Step 4
#Finally Rewritten Instruction#:
...

4. Results

4.1 Number of Data Obtained (Pipeline Yield)

The data transition (yield) at each step was as follows.

1. Problem Generation

Input: 1177 items
Generation successful: 1148 items (29 failed with "Extraction Failed")

2. Answer Generation

Input: 1148 items
Generation successful: 518 items (601 "NO SOLUTION FOUND", 29 empty values)
Yield: Approx. 45.1%

3. CoT Generation

Input: 526 items
Passed quality check: 21 items
Yield: Approx. 4.0%

First, regarding problem generation, a visual inspection of a few problems confirmed that the problems had indeed become more difficult (increased abstraction, added constraints, required different mathematical approaches).
In the final CoT generation step, we checked the quality based on the following criteria and manually corrected any stylistic issues that could be fixed.

The correct answer from answer generation matches the correct answer in the CoT.
It adheres to the format (enclosed in <think></think> tags, answer enclosed in \boxed{}).
No other languages (especially Chinese) are mixed in.
The CoT is of sufficient length, and there are no obvious omissions or contradictions in the process leading to the answer.

In the end, the dataset we synthesized from the 1177 original data items consisted of 21 problems.
https://huggingface.co/datasets/neko-llm/HLE_RL_Olympiadbench-v2

4.2 Time Required for Problem Generation

Problem Generation: 1-5 minutes/problem
Answer Generation: 15-30 minutes/problem
CoT Generation: 3-5 minutes/problem

(Reference: If we tried to generate 1000 problems with this pipeline, answer generation alone would require a minimum of 1000 * 15 minutes = 15000 minutes ≈ 250 hours.)

5. Discussion & Retrospective

Although our initial goal was to build a dataset of 1000 problems, we ended up with only 21. This section summarizes the insights gained from this experience and improvements for any future attempts.

5.1 Data Synthesis Cost is Far Greater in Verification than Generation

The bottleneck in this pipeline, both in terms of processing time and final data count (yield), was overwhelmingly in the verification and quality assurance steps of "Answer Generation" and "CoT Generation".

Answer Generation (Yield: Approx. 46%)

The reasons for the high frequency of "NO SOLUTION FOUND" include: (1) The problems generated by Evol-Instruct were far beyond the capability of the deepseek-r1-0528 model acting as the Solver, and (2) To generate one answer, we had to connect three agents simultaneously and send multiple prompts, which increased the risk of unstable responses from the API.
Especially for (2), we had to start by distinguishing whether it truly couldn't be solved or if the API response was just unstable. Although retrying covered some of it, this step ended up taking much more time than expected.

CoT Generation (Yield: Approx. 4%)

Most of the data was lost during the quality check. The most fatal issue was that, as a result of the 2-step (generation -> correction) prompt, the final CoT ended up "checking the first-step answer" instead of directly answering the original problem. This occurred in over 100 cases.
This was a design flaw in the CoT generation prompt, caused by not clearly defining "what the LLM should do." It was a preventable mistake, and I deeply regret not checking the CoT in detail when I verified the first few problems.

The takeaway from this is the reality that "While LLMs will generate content as instructed, their output quality (format, logic, adherence to the task) is not guaranteed at all." This quality variance seems to have been particularly pronounced in high-difficulty tasks like HLE.

We learned a practical lesson: any current data synthesis pipeline using LLMs must assume this variance, incorporate strict automated verification at each step, and include a final manual check. The timeline must be planned with this in mind.

5.2 If I Were to Try Again

If I were to undertake a similar effort again, I would want to address the following points.

1. Pipeline Improvement (Yield Improvement)

Answer Generation

Flexible Verification Methods: Methods like "3 consecutive passes" were strict. Other studies used techniques like generating multiple answers and using a majority vote (ensemble) or cross-checking between different models. We couldn't try those this time, so I'd like to attempt them.

CoT Generation

Prompt Advancement: By improving the custom prompt and introducing Few-shot learning (including high-quality CoT examples in the prompt), I believe there's a high probability we can stabilize the output format and logical flow.
Task Separation: Besides generating the "answer (\boxed{})" and "CoT ()" at once, another confirmed method is to first provide the correct answer and then ask the LLM to "generate the CoT that leads to this answer." I believe this method could cover a certain number of cases where CoT generation failed or the answers didn't match.
Automated CoT Quality Assessment: We could also try to reduce the burden of manual checking by having another LLM score the logical consistency of the generated CoT for primary filtering.

2. Verification of Impact on SFT

Although the 21 items we generated are few, they might be high-quality (or extremely high-difficulty) data, having passed through a very difficult pipeline.
Even if the impact on the overall HLE score is small, there might have been a localized effect, such as a slight improvement in the correct answer rate in the mathematics category (or even a sub-category within it).
Just running SFT even with this small amount of data and observing changes in the model's performance would have provided important insights for planning the next data synthesis strategy. It's regrettable that this wasn't woven into the initial plan, and that we ultimately couldn't execute it due to the project's timeline and priorities. I definitely want to try that next time.

6. References

Huang, Y., & Yang, L. F. (2025). Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2507.15855

Humanity’s Last Exam. (2025). SEAL Leaderboard. https://scale.com/leaderboard/humanitys_last_exam

7. Acknowledgements

In this project, we were privileged to use massive computational resources: 36 H100 GPUs (80GB) for approximately 30 days.
I am extremely grateful to Matsuo Lab, the New Energy and Industrial Technology Development Organization (NEDO), SAKURA internet, and all other affiliated parties for providing this opportunity free of charge.
Furthermore, I feel truly glad to have been able to work as part of Team Neko, even though almost all of us were meeting for the first time. It is regrettable that we narrowly missed advancing to the finals.
I especially want to thank our team leader, Honda-san (@Damin3927). Despite most team members being strangers with different backgrounds and working in an online environment, he brilliantly fostered psychological safety within the team while demonstrating leadership backed by strong technical skill and judgment. I would like to take this opportunity to express my heartfelt thanks.

*This project is conducted as part of the "Development Project for a Japanese-specific Medical LLM" under the "Safety Verification and Demonstration for Social Implementation of a Japanese-specific Medical LLM" initiative by the New Energy and Industrial Technology Development Organization (NEDO).

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up