言語 / Language / 语言 / 언어: 日本語 | English | 中文 | 한국어
日本語
llive 完全解説 (8) — 「眼鏡を作る」: lleval — honest disclosure 5+1 因子分解で AI を評価する
コンセプト hook: AI を作るだけでは足りない. AI を見る眼鏡 が要る.
lleval は llive と並走する evaluation framework で, 「LLM が異常に
良い結果を出したら必ず内訳を疑う」というfeedback_benchmark_honest_disclosure
ルールを コードの一級概念 に昇格させた. progressive size matrix で
stress curve を取り, judge rotation で position bias を消す.結論を先に出すと: 「速い AI」ではなく「速いと思い込ませる構成」 を見抜く
道具.
0. 連載中での位置づけ
#24-00 series index
#24-01 4 層メモリ
#24-02 思考因子 × COG-MESH
#24-03 構造進化 × TRIZ × Z3
#24-04 B-series
#24-05 EvolutionLoop
#24-06 LLM backend non-transformer
#24-07 observability + governance
#24-08 lleval — eval framework (← 本記事)
#24-07 が「何を残すか」(audit) だとすると, 本記事は「何を測るか」.
測定なしに改善はない.
1. lleval の出自 — honest disclosure 事件
事の発端は 2026-05-17 の benchmark. llive が他社 LLM API より 異常に速く
出た数字があった. 普通なら勝った気になるところを, ユーザーは「内訳を
疑え」と指示. 蓋を開けると:
- LLMBackend が attach されていなかった (mock で動いていた)
- chars 指標が不公平 (英語 token を文字数換算)
- subprocess RTT を除外 (起動コストを無視)
3 つの artifact が複合していた. これを記録 (feedback_benchmark_honest_disclosure)
してから, 「ベンチで異常結果が出たら必ず 5 つの artifact を疑う」を
外部化 したくなった. それが lleval.
2. 5+1 因子分解 — honest disclosure の構造化
lleval HonestDisclosureAnalyzer (2026-05-21 朝着地) は出力差分を 5+1 因子に
分解:
| 因子 | 意味 | 検出方法 |
|---|---|---|
| F1: prompt difference | 同 prompt が本当に同じか | 文字列 diff + token diff |
| F2: model id mismatch | model id が runtime と spec で一致か |
runtime_metadata.model_id 比較 |
| F3: backend swap | LLMBackend が attach されているか | runtime hook で trace |
| F4: chars vs tokens | 評価指標が言語非依存か | tokenizer count |
| F5: RTT exclusion | subprocess / network RTT が時間に含まれるか | wall-clock vs CPU time |
| +1: env drift | 並走負荷 / OS schedule / thermal | 環境 fingerprint snapshot |
5+1 が すべて clean で初めて「数値は信頼できる」. 1 つでも怪しいと
honest disclosure note が結果に sticky される.
3. progressive size matrix — stress curve を取る
固定 token 数のベンチは情報量が低い. lleval は xs/s/m/l/xl の 5 段階 ×
複数 model の matrix を回す:
size: xs (128) s (512) m (2k) l (8k) xl (32k)
mock 0.05 0.18 0.62 2.41 9.82
llive 0.07 0.24 0.71 2.55 9.96 ← 大差ない
gpt-4o 0.31 0.52 1.20 3.40 11.2 ← crossover at l
これで「どのサイズで crossover が起きるか」が一目. 単一サイズで「勝った」
と言ってもサイズ違いでは負ける. fair.
4. judge rotation — position bias を消す
LLM-as-judge で 2 案 (A, B) を比較するとき, 順序が score に effect する
ことが知られている (Zheng et al. 2023). lleval は:
- (A, B) で 1 回 judge
- (B, A) で 1 回 judge
- 2 つの verdict が一致しないとき inconsistency flag
これは judge LLM 自身の bias を量子化する手段. inconsistency が 30% 超
なら judge LLM を切り替える運用 (judge rotation).
5. bridges/llive — llive Genome → ProviderSpec mapper
lleval は llive の派生個体 を直接食えるよう設計. bridges/llive.py
(2026-05-21 朝着地):
from llive.perf.evolutionary import Individual
from lleval.bridges.llive import individual_to_provider_spec
ind: Individual = ... # 派生集団から 1 個体
spec = individual_to_provider_spec(ind)
# spec.model_id, spec.temperature, spec.top_p, ... を ind.genome.values から復元
result = lleval.run(spec, dataset="qa_50")
これで「派生集団の進化 と 派生集団の評価」が ループする. llive 内の
EvolutionLoop fitness にそのまま渡せる.
6. honest disclosure (lleval 自身について)
メタにも honest disclosure を適用:
-
lleval test 数 61 — 本日 2026-05-21 時点. 上位フレームワーク (Promptfoo
本体) は数千 test を持つ. lleval は wrap であり置換ではない. -
判定の絶対基準は無い — F1〜F5 + 環境 fingerprint が clean でも
「ベンチが正しい」とは限らない. 「怪しいサイン」 を消した状態に過ぎない. -
judge rotation はコストがかかる — 2 倍呼び出すので credential 使用量も
2 倍. honest 検出のためのコスト. -
progressive matrix のサイズ等比は heuristic — 4x ずつ (128 → 512 → 2k
→ 8k → 32k) で取っているが, 真の crossover が 2k と 8k の間にある場合
解像度不足. 必要に応じ細密化. -
環境 fingerprint は完璧ではない — Windows / Linux / macOS 間の thermal
throttling 違いまでは捉えていない. 「ベンチを別 OS で取り直す」が最終手段.
7. 数字 (本日 2026-05-21 時点)
| 項目 | 値 |
|---|---|
| lleval test PASS | 61 |
| 着地 module | 13 (config / runner / analyzer / providers / bridges / report html+md / cli / ...) |
| 5+1 因子検出ロジック | 着地済 |
| progressive matrix runner | 着地済 |
| judge rotation | 着地済 |
| bridges/llive.py | 着地済 (skeleton) |
| v0.1.0a1 PyPI 公開準備 | (credential 復旧後) |
| 連載 #24 への登場 | 本記事 (#24-08) |
8. 期待値 — 次に来るもの
- v0.1.0a2 で promptfoo 実走 + llive Genome → ProviderSpec mapping 完成.
- v0.2 で judge rotation + position swap + Phoenix OpenInference trace.
- v1.0 で plugin marketplace + 商用 dual-license.
9. References
- Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.
- Promptfoo OSS (https://github.com/promptfoo/promptfoo).
- Anthropic Eval framework (2023).
- 完全リストは v0.1.0 リリース時に references.bib に同梱予定.
10. 2026-05-22 追記 — 5+1 因子分解 と Rust 化 5 パターン判定表の方法論的共通点
lleval の honest disclosure 5+1 因子分解 (prompt diff / model id /
backend swap / chars vs tokens / RTT / env drift) と, 同日着地した
llive Rust 高速化の 5 パターン判定表 (#24-05 §13.3) は 構造的に同じ
発想 で書かれている.
| 共通する思想 | lleval 5+1 因子 | Rust 化 5 パターン |
|---|---|---|
| 「結果」を信じる前に 要素分解 | 速度差を 6 因子に分解 | 速度比を Python 経路の特性別 5 パターンに分類 |
| 異常結果は内訳を疑う | F1〜F5 + env を疑う | 単発 0.80x も x66.70 も「内訳」で説明できる |
| 観察が外部化されている | analyzer で自動検出 | 判定表 + bench script で自動測定 |
| honest disclosure を一級概念に | 数値に sticky note | judgment 表で どこが境界線か を明示 |
両者とも「「速い」「正しい」「正確」の単一仮定を捨てる」という
feedback_benchmark_honest_disclosure の延長線上にある. これは lleval が
AI を見るだけでなく AI / システム / アルゴリズム 全般 に展開できる
発想 = 連載 #24-08 のメタ的意義.
詳細: docs/perf_comparison/2026-05-22_kernel_implementation_comparison.md.
Series Navigation
English
llive Complete Guide (8) — "Making the Glasses": lleval — evaluating AI via honest-disclosure 5+1 factor decomposition
Concept hook: Building AI is not enough. You need glasses to see the AI.
lleval is an evaluation framework that runs alongside llive, promoting the
feedback_benchmark_honest_disclosurerule — "when an LLM produces an
abnormally good result, always doubt the breakdown" — into a first-class
concept in code. It takes a stress curve via a progressive size matrix and
eliminates position bias via judge rotation.The conclusion up front: a tool to spot not the "fast AI" but the
"setup that makes you believe it is fast".
0. Position within the series
#24-00 series index
#24-01 4-layer memory
#24-02 thought factors × COG-MESH
#24-03 structural evolution × TRIZ × Z3
#24-04 B-series
#24-05 EvolutionLoop
#24-06 LLM backend non-transformer
#24-07 observability + governance
#24-08 lleval — eval framework (← this article)
If #24-07 was about "what to keep" (audit), this article is about "what to
measure". There is no improvement without measurement.
1. The origin of lleval — the honest-disclosure incident
It all started with a 2026-05-17 benchmark. There was a number where llive came
out abnormally faster than competing cloud LLM APIs. Where one would normally
feel like a winner, the user instead instructed: "doubt the breakdown". Once
we opened the lid:
- The LLMBackend was not attached (it was running on a mock)
- The chars metric was unfair (counting English tokens as character counts)
- subprocess RTT was excluded (ignoring startup cost)
Three artifacts were compounded. After recording this
(feedback_benchmark_honest_disclosure), we wanted to externalize the rule
"when a benchmark produces an abnormal result, always doubt the 5 artifacts".
That became lleval.
2. The 5+1 factor decomposition — structuring honest disclosure
lleval's HonestDisclosureAnalyzer (landed the morning of 2026-05-21) decomposes
output deltas into 5+1 factors:
| Factor | Meaning | Detection method |
|---|---|---|
| F1: prompt difference | Whether the same prompt is truly the same | string diff + token diff |
| F2: model id mismatch | Whether model id matches between runtime and spec | compare runtime_metadata.model_id
|
| F3: backend swap | Whether the LLMBackend is attached | trace via a runtime hook |
| F4: chars vs tokens | Whether the eval metric is language-independent | tokenizer count |
| F5: RTT exclusion | Whether subprocess / network RTT is included in the timing | wall-clock vs CPU time |
| +1: env drift | Concurrent load / OS schedule / thermal | environment fingerprint snapshot |
Only when the 5+1 are all clean can "the numbers are trustworthy". If even one
is suspicious, an honest disclosure note is made sticky on the result.
3. The progressive size matrix — taking the stress curve
A fixed-token benchmark is low on information. lleval runs a matrix of an
xs/s/m/l/xl 5-step × multiple models:
size: xs (128) s (512) m (2k) l (8k) xl (32k)
mock 0.05 0.18 0.62 2.41 9.82
llive 0.07 0.24 0.71 2.55 9.96 ← no big difference
gpt-4o 0.31 0.52 1.20 3.40 11.2 ← crossover at l
This makes "at which size the crossover happens" obvious at a glance. Saying
you "won" at a single size means you lose at a different size. Fair.
4. judge rotation — eliminating position bias
When an LLM-as-judge compares 2 options (A, B), it is known that the order
effects the score (Zheng et al. 2023). lleval does:
- Judge once with (A, B)
- Judge once with (B, A)
- When the two verdicts disagree, raise an inconsistency flag
This is a means of quantizing the judge LLM's own bias. If inconsistency exceeds
30%, switch the judge LLM (judge rotation).
5. bridges/llive — llive Genome → ProviderSpec mapper
lleval is designed to consume llive's derived individuals directly.
bridges/llive.py (landed the morning of 2026-05-21):
from llive.perf.evolutionary import Individual
from lleval.bridges.llive import individual_to_provider_spec
ind: Individual = ... # one individual from the derived population
spec = individual_to_provider_spec(ind)
# restore spec.model_id, spec.temperature, spec.top_p, ... from ind.genome.values
result = lleval.run(spec, dataset="qa_50")
This makes "evolving the derived population and evaluating the derived
population" loop. It can be fed directly into the EvolutionLoop fitness inside
llive.
6. honest disclosure (about lleval itself)
Apply honest disclosure to the meta-tool as well:
-
lleval has 61 tests — as of today, 2026-05-21. The upstream framework
(Promptfoo itself) has thousands of tests. lleval is a wrap, not a replacement. -
There is no absolute criterion for the verdict — even if F1–F5 + the
environment fingerprint are clean, it does not mean "the benchmark is correct".
It is merely a state where the "suspicious signs" have been erased. -
judge rotation is costly — it calls twice, so credential usage doubles too.
A cost paid for honest detection. -
The size ratio of the progressive matrix is a heuristic — it is taken at 4x
steps (128 → 512 → 2k → 8k → 32k), but if the true crossover lies between 2k and
8k, the resolution is insufficient. Refine as needed. -
The environment fingerprint is not perfect — it does not even capture the
thermal throttling differences across Windows / Linux / macOS. "Re-taking the
benchmark on a different OS" is the last resort.
7. The numbers (as of today, 2026-05-21)
| Item | Value |
|---|---|
| lleval test PASS | 61 |
| landed modules | 13 (config / runner / analyzer / providers / bridges / report html+md / cli / ...) |
| 5+1 factor detection logic | landed |
| progressive matrix runner | landed |
| judge rotation | landed |
| bridges/llive.py | landed (skeleton) |
| v0.1.0a1 PyPI publish prep | (after credential recovery) |
| Appearance in series #24 | this article (#24-08) |
8. Expectations — what comes next
- v0.1.0a2: real promptfoo runs + completing the llive Genome → ProviderSpec mapping.
- v0.2: judge rotation + position swap + Phoenix OpenInference trace.
- v1.0: plugin marketplace + commercial dual-license.
9. References
- Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.
- Promptfoo OSS (https://github.com/promptfoo/promptfoo).
- Anthropic Eval framework (2023).
- The full list will be bundled in references.bib at the v0.1.0 release.
10. 2026-05-22 addendum — the methodological commonality between the 5+1 factor decomposition and the 5-pattern Rust-port decision table
lleval's honest-disclosure 5+1 factor decomposition (prompt diff / model id /
backend swap / chars vs tokens / RTT / env drift) and the llive Rust-speedup
5-pattern decision table (#24-05 §13.3) that landed the same day are written
with structurally the same idea.
| Shared thinking | lleval 5+1 factors | Rust-port 5 patterns |
|---|---|---|
| Decompose into elements before believing "the result" | decompose the speed delta into 6 factors | classify the speed ratio into 5 patterns by the characteristics of the Python path |
| Doubt the breakdown of an abnormal result | doubt F1–F5 + env | both a one-off 0.80x and x66.70 can be explained by the "breakdown" |
| The observation is externalized | auto-detected by the analyzer | auto-measured by the decision table + bench script |
| Honest disclosure as a first-class concept | sticky note on the numbers | the judgment table makes where the boundary line is explicit |
Both lie on the extension of feedback_benchmark_honest_disclosure —
"discard the single assumption of 'fast' / 'correct' / 'accurate'". This is
the idea that lleval can expand beyond just seeing AI to AI / systems /
algorithms in general = the meta-significance of series #24-08.
Details: docs/perf_comparison/2026-05-22_kernel_implementation_comparison.md.
Series Navigation
- ← Prev: llive Complete Guide (7) "AI with Built-in Review"
- All: llive Complete Guide (0) — series index
- repo: furuse-kazufumi/llive
中文
llive 完全解说 (8) — 「制作眼镜」: lleval — 用 honest disclosure 5+1 因子分解评估 AI
概念 hook: 只是造 AI 还不够. 还需要 看 AI 的眼镜.
lleval 是与 llive 并行的 evaluation framework, 它把
feedback_benchmark_honest_disclosure规则 —「LLM 出现异常好的结果时必须怀疑内訳」—
提升为 代码中的一级概念. 用 progressive size matrix 取 stress 曲线,
用 judge rotation 消除 position bias.先给结论: 一个看穿的不是 「快的 AI」 而是 「让你误以为快的构成」 的工具.
0. 在系列中的定位
#24-00 series index
#24-01 4 层记忆
#24-02 思考因子 × COG-MESH
#24-03 结构进化 × TRIZ × Z3
#24-04 B-series
#24-05 EvolutionLoop
#24-06 LLM backend non-transformer
#24-07 observability + governance
#24-08 lleval — eval framework (← 本文)
如果 #24-07 是关于「保留什么」(audit), 那么本文是关于「测量什么」.
没有测量就没有改进.
1. lleval 的由来 — honest disclosure 事件
事情起于 2026-05-17 的 benchmark. 当时有一个数字, llive 比竞争对手的 cloud LLM API
异常地快. 一般人会觉得自己赢了, 但用户却指示「怀疑内訳」. 揭开盖子后:
- LLMBackend 没有 attach (是在 mock 上跑的)
- chars 指标不公平 (把英语 token 当作字符数换算)
- 排除了 subprocess RTT (忽略了启动成本)
三个 artifact 复合在一起. 记录下来 (feedback_benchmark_honest_disclosure) 之后,
我们想把「基准出现异常结果时必须怀疑那 5 个 artifact」外部化. 那就是 lleval.
2. 5+1 因子分解 — honest disclosure 的结构化
lleval 的 HonestDisclosureAnalyzer (2026-05-21 上午落地) 把输出差异分解为 5+1 因子:
| 因子 | 含义 | 检测方法 |
|---|---|---|
| F1: prompt difference | 同一 prompt 是否真的相同 | 字符串 diff + token diff |
| F2: model id mismatch | model id 在 runtime 与 spec 间是否一致 | 比较 runtime_metadata.model_id
|
| F3: backend swap | LLMBackend 是否已 attach | 用 runtime hook trace |
| F4: chars vs tokens | 评估指标是否语言无关 | tokenizer count |
| F5: RTT exclusion | subprocess / network RTT 是否计入时间 | wall-clock vs CPU time |
| +1: env drift | 并行负载 / OS schedule / thermal | 环境 fingerprint snapshot |
只有当 5+1 全部 clean 时才能「数值可信」. 只要有一个可疑,
就会有一条 honest disclosure note 被 sticky 到结果上.
3. progressive size matrix — 取 stress 曲线
固定 token 数的基准信息量太少. lleval 跑 xs/s/m/l/xl 5 阶 ×
多个 model 的 matrix:
size: xs (128) s (512) m (2k) l (8k) xl (32k)
mock 0.05 0.18 0.62 2.41 9.82
llive 0.07 0.24 0.71 2.55 9.96 ← 差别不大
gpt-4o 0.31 0.52 1.20 3.40 11.2 ← crossover at l
这样「crossover 在哪个 size 发生」一目了然. 即使在单一 size 上说「赢了」,
换个 size 就会输. fair.
4. judge rotation — 消除 position bias
用 LLM-as-judge 比较 2 个候选 (A, B) 时, 已知顺序会 effect 得分 (Zheng et al. 2023).
lleval 这样做:
- 用 (A, B) judge 1 次
- 用 (B, A) judge 1 次
- 两个 verdict 不一致时, 触发 inconsistency flag
这是把 judge LLM 自身的 bias 量子化的手段. inconsistency 超过 30%
就切换 judge LLM 的运用 (judge rotation).
5. bridges/llive — llive Genome → ProviderSpec mapper
lleval 设计为可直接吃 llive 的派生个体. bridges/llive.py
(2026-05-21 上午落地):
from llive.perf.evolutionary import Individual
from lleval.bridges.llive import individual_to_provider_spec
ind: Individual = ... # 从派生群体取 1 个体
spec = individual_to_provider_spec(ind)
# 从 ind.genome.values 复原 spec.model_id, spec.temperature, spec.top_p, ...
result = lleval.run(spec, dataset="qa_50")
这样「派生群体的进化 与 派生群体的评估」就形成 loop. 可以直接喂给 llive 内的
EvolutionLoop fitness.
6. honest disclosure (关于 lleval 本身)
也对元工具自身应用 honest disclosure:
-
lleval test 数 61 — 截至今日 2026-05-21. 上游框架 (Promptfoo 本体) 拥有数千 test.
lleval 是 wrap, 不是替换. -
判定没有绝对基准 — 即使 F1〜F5 + 环境 fingerprint 都 clean,
也不代表「基准是正确的」. 只不过是把「可疑的迹象」消掉的状态而已. - judge rotation 有成本 — 调用 2 倍, 所以 credential 使用量也 2 倍. 为 honest 检测付出的成本.
-
progressive matrix 的 size 等比是 heuristic — 按 4x 取 (128 → 512 → 2k →
8k → 32k), 但若真正的 crossover 在 2k 与 8k 之间, 则分辨率不足. 视需要细化. -
环境 fingerprint 并不完美 — 它连 Windows / Linux / macOS 之间的 thermal
throttling 差异都没捕捉到. 「在另一个 OS 上重取基准」是最终手段.
7. 数字 (截至今日 2026-05-21)
| 项目 | 值 |
|---|---|
| lleval test PASS | 61 |
| 着地 module | 13 (config / runner / analyzer / providers / bridges / report html+md / cli / ...) |
| 5+1 因子检测逻辑 | 已着地 |
| progressive matrix runner | 已着地 |
| judge rotation | 已着地 |
| bridges/llive.py | 已着地 (skeleton) |
| v0.1.0a1 PyPI 公开准备 | (credential 复原后) |
| 在系列 #24 中的登场 | 本文 (#24-08) |
8. 期望值 — 接下来要做的
- v0.1.0a2: promptfoo 实跑 + 完成 llive Genome → ProviderSpec mapping.
- v0.2: judge rotation + position swap + Phoenix OpenInference trace.
- v1.0: plugin marketplace + 商用 dual-license.
9. References
- Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.
- Promptfoo OSS (https://github.com/promptfoo/promptfoo).
- Anthropic Eval framework (2023).
- 完整列表将在 v0.1.0 发布时随 references.bib 一同提供.
10. 2026-05-22 追记 — 5+1 因子分解与 Rust 化 5 模式判定表的方法论共性
lleval 的 honest disclosure 5+1 因子分解 (prompt diff / model id /
backend swap / chars vs tokens / RTT / env drift) 与同日着地的
llive Rust 高速化 5 模式判定表 (#24-05 §13.3) 是用 结构上相同的发想 写就的.
| 共通的思想 | lleval 5+1 因子 | Rust 化 5 模式 |
|---|---|---|
| 在相信「结果」之前 要素分解 | 把速度差分解为 6 因子 | 把速度比按 Python 经路的特性分为 5 模式 |
| 异常结果就怀疑内訳 | 怀疑 F1〜F5 + env | 单发 0.80x 也好 x66.70 也好都能用「内訳」解释 |
| 观察被外部化 | 用 analyzer 自动检测 | 用判定表 + bench script 自动测量 |
| 把 honest disclosure 当一级概念 | 给数值贴 sticky note | 用 judgment 表明示 边界线在哪里 |
两者都处在「抛弃「快」「对」「准」的单一假设」这一
feedback_benchmark_honest_disclosure 的延长线上. 这是 lleval 不仅能看 AI,
还能展开到 AI / 系统 / 算法 全般 的发想 = 连载 #24-08 的元意义.
详情: docs/perf_comparison/2026-05-22_kernel_implementation_comparison.md.
Series Navigation
- ← 上一篇: llive 完全解说 (7) 「带审查的 AI」
- 全部: llive 完全解说 (0) — series index
- repo: furuse-kazufumi/llive
한국어
llive 완전 해설 (8) — 「안경을 만든다」: lleval — honest disclosure 5+1 인자 분해로 AI 를 평가한다
콘셉트 hook: AI 를 만드는 것만으로는 부족하다. AI 를 보는 안경 이 필요하다.
lleval 은 llive 와 병주하는 evaluation framework 로,
「LLM 이 이상하게 좋은 결과를 내면 반드시 내역을 의심한다」는
feedback_benchmark_honest_disclosure규칙을 코드의 일급 개념 으로 승격시켰다.
progressive size matrix 로 stress curve 를 취하고, judge rotation 으로 position bias 를 없앤다.결론을 먼저 말하면: 「빠른 AI」 가 아니라 「빠르다고 착각하게 만드는 구성」 을
간파하는 도구.
0. 연재에서의 위치
#24-00 series index
#24-01 4층 메모리
#24-02 사고 인자 × COG-MESH
#24-03 구조 진화 × TRIZ × Z3
#24-04 B-series
#24-05 EvolutionLoop
#24-06 LLM backend non-transformer
#24-07 observability + governance
#24-08 lleval — eval framework (← 본 글)
#24-07 이 「무엇을 남길 것인가」(audit) 라면, 본 글은 「무엇을 측정할 것인가」.
측정 없이는 개선이 없다.
1. lleval 의 출자 — honest disclosure 사건
발단은 2026-05-17 의 benchmark. llive 가 타사 LLM API 보다 이상하게 빠르게
나온 숫자가 있었다. 보통이라면 이긴 기분이 들 대목에서, 사용자는 「내역을
의심하라」고 지시했다. 뚜껑을 열어보니:
- LLMBackend 가 attach 되어 있지 않았다 (mock 으로 돌고 있었다)
- chars 지표가 불공평 (영어 token 을 글자 수로 환산)
- subprocess RTT 를 제외 (기동 비용을 무시)
세 가지 artifact 가 복합되어 있었다. 이것을 기록 (feedback_benchmark_honest_disclosure)
한 뒤, 「벤치에서 이상 결과가 나오면 반드시 5 가지 artifact 를 의심한다」를
외부화 하고 싶어졌다. 그것이 lleval.
2. 5+1 인자 분해 — honest disclosure 의 구조화
lleval 의 HonestDisclosureAnalyzer (2026-05-21 아침 착지) 는 출력 차분을 5+1 인자로 분해:
| 인자 | 의미 | 검출 방법 |
|---|---|---|
| F1: prompt difference | 같은 prompt 가 정말 같은가 | 문자열 diff + token diff |
| F2: model id mismatch | model id 가 runtime 과 spec 에서 일치하는가 |
runtime_metadata.model_id 비교 |
| F3: backend swap | LLMBackend 가 attach 되어 있는가 | runtime hook 으로 trace |
| F4: chars vs tokens | 평가 지표가 언어 비의존인가 | tokenizer count |
| F5: RTT exclusion | subprocess / network RTT 가 시간에 포함되는가 | wall-clock vs CPU time |
| +1: env drift | 병주 부하 / OS schedule / thermal | 환경 fingerprint snapshot |
5+1 이 모두 clean 일 때 비로소 「수치는 신뢰할 수 있다」. 하나라도 의심스러우면
honest disclosure note 가 결과에 sticky 된다.
3. progressive size matrix — stress curve 를 취한다
고정 token 수의 벤치는 정보량이 낮다. lleval 은 xs/s/m/l/xl 의 5 단계 ×
여러 model 의 matrix 를 돌린다:
size: xs (128) s (512) m (2k) l (8k) xl (32k)
mock 0.05 0.18 0.62 2.41 9.82
llive 0.07 0.24 0.71 2.55 9.96 ← 큰 차이 없음
gpt-4o 0.31 0.52 1.20 3.40 11.2 ← crossover at l
이로써 「어느 size 에서 crossover 가 일어나는가」가 한눈에. 단일 size 에서 「이겼다」
고 해도 다른 size 에서는 진다. fair.
4. judge rotation — position bias 를 없앤다
LLM-as-judge 로 2 안 (A, B) 을 비교할 때, 순서가 score 에 effect 한다는 것이
알려져 있다 (Zheng et al. 2023). lleval 은:
- (A, B) 로 1 회 judge
- (B, A) 로 1 회 judge
- 두 verdict 가 일치하지 않을 때 inconsistency flag
이것은 judge LLM 자신의 bias 를 양자화하는 수단. inconsistency 가 30% 초과
면 judge LLM 을 전환하는 운용 (judge rotation).
5. bridges/llive — llive Genome → ProviderSpec mapper
lleval 은 llive 의 파생 개체 를 직접 먹을 수 있도록 설계. bridges/llive.py
(2026-05-21 아침 착지):
from llive.perf.evolutionary import Individual
from lleval.bridges.llive import individual_to_provider_spec
ind: Individual = ... # 파생 집단에서 1 개체
spec = individual_to_provider_spec(ind)
# spec.model_id, spec.temperature, spec.top_p, ... 를 ind.genome.values 에서 복원
result = lleval.run(spec, dataset="qa_50")
이로써 「파생 집단의 진화 와 파생 집단의 평가」가 loop 한다. llive 내의
EvolutionLoop fitness 에 그대로 넘길 수 있다.
6. honest disclosure (lleval 자신에 대해)
메타에도 honest disclosure 를 적용:
-
lleval test 수 61 — 오늘 2026-05-21 시점. 상위 프레임워크 (Promptfoo 본체) 는
수천 test 를 가진다. lleval 은 wrap 이며 치환이 아니다. -
판정의 절대 기준은 없다 — F1〜F5 + 환경 fingerprint 가 clean 이어도
「벤치가 옳다」는 보장은 없다. 「의심스러운 사인」을 지운 상태에 불과하다. -
judge rotation 은 비용이 든다 — 2 배 호출하므로 credential 사용량도 2 배.
honest 검출을 위한 비용. -
progressive matrix 의 size 등비는 heuristic — 4x 씩 (128 → 512 → 2k →
8k → 32k) 으로 취하고 있지만, 진짜 crossover 가 2k 와 8k 사이에 있을 경우
해상도 부족. 필요에 따라 세밀화. -
환경 fingerprint 는 완벽하지 않다 — Windows / Linux / macOS 간의 thermal
throttling 차이까지는 잡지 못한다. 「벤치를 다른 OS 에서 다시 취한다」가 최종 수단.
7. 숫자 (오늘 2026-05-21 시점)
| 항목 | 값 |
|---|---|
| lleval test PASS | 61 |
| 착지 module | 13 (config / runner / analyzer / providers / bridges / report html+md / cli / ...) |
| 5+1 인자 검출 로직 | 착지됨 |
| progressive matrix runner | 착지됨 |
| judge rotation | 착지됨 |
| bridges/llive.py | 착지됨 (skeleton) |
| v0.1.0a1 PyPI 공개 준비 | (credential 복구 후) |
| 연재 #24 로의 등장 | 본 글 (#24-08) |
8. 기대값 — 다음에 오는 것
- v0.1.0a2: promptfoo 실주 + llive Genome → ProviderSpec mapping 완성.
- v0.2: judge rotation + position swap + Phoenix OpenInference trace.
- v1.0: plugin marketplace + 상용 dual-license.
9. References
- Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.
- Promptfoo OSS (https://github.com/promptfoo/promptfoo).
- Anthropic Eval framework (2023).
- 완전한 목록은 v0.1.0 릴리스 시 references.bib 에 동봉할 예정.
10. 2026-05-22 추기 — 5+1 인자 분해와 Rust화 5 패턴 판정표의 방법론적 공통점
lleval 의 honest disclosure 5+1 인자 분해 (prompt diff / model id /
backend swap / chars vs tokens / RTT / env drift) 와 같은 날 착지한
llive Rust 고속화의 5 패턴 판정표 (#24-05 §13.3) 는 구조적으로 같은
발상 으로 쓰였다.
| 공통하는 사상 | lleval 5+1 인자 | Rust화 5 패턴 |
|---|---|---|
| 「결과」를 믿기 전에 요소 분해 | 속도 차를 6 인자로 분해 | 속도 비를 Python 경로의 특성별 5 패턴으로 분류 |
| 이상 결과는 내역을 의심 | F1〜F5 + env 를 의심 | 단발 0.80x 도 x66.70 도 「내역」으로 설명할 수 있다 |
| 관찰이 외부화되어 있다 | analyzer 로 자동 검출 | 판정표 + bench script 로 자동 측정 |
| honest disclosure 를 일급 개념으로 | 수치에 sticky note | judgment 표로 어디가 경계선인가 를 명시 |
양자 모두 「「빠르다」「옳다」「정확하다」의 단일 가정을 버린다」는
feedback_benchmark_honest_disclosure 의 연장선상에 있다. 이것은 lleval 이
AI 를 보는 것뿐만 아니라 AI / 시스템 / 알고리즘 전반 으로 전개할 수 있는
발상 = 연재 #24-08 의 메타적 의의.
자세히: docs/perf_comparison/2026-05-22_kernel_implementation_comparison.md.