SUT‑XR: An External Framework for Evaluating and Improving AI Explanations
Semantic Understanding Theory – External Rating Model
Even when AI is asked to “explain clearly,” common problems arise:
- Explanations are overly long
- They deviate from the intended meaning
- They are redundant
- The intended rationale is not conveyed
To address this, I developed SUT‑XR, an external evaluation framework for AI explanations.
This is not a method for improving the AI itself, but a framework for managing the quality of its explanations.
1. Why an “External Frame”?
Even if an AI is programmed with extensive rules:
- Rules can break midway
- The AI may mimic form without genuine understanding
- Consistency can be lost
To address these limitations, we reverse the perspective:
Establish a layer outside the AI to evaluate its explanations.
Advantages include:
- No additional computational burden on the AI
- Human control over explanation quality
- Ability to measure improvements via before/after comparisons
2. CISA: Evaluating Explanations Along Four Axes
An explanation can be represented as the following causal flow:
Context → Intent → Structure → Action
Each axis is scored from 0 to 1.
Context
Are the situation and assumptions clearly stated?
Intent
Is the purpose or rationale explicit?
Structure
Are concepts, causality, and flow well-organized?
Action
Are the steps concise, clear, and unambiguous?
3. Failure Modes: Eight Categories of Explanation Failures
Explanation failures fall into eight categories:
Basic Four
- Context_missing
- Intent_missing
- Structure_missing
- Action_missing
Procedural Issue
- Procedure_confusion
Qualitative Failures
- Inconsistency (contradictions)
- Redundancy
- Misalignment (misfit with user expectations)
Each failure is assigned a severity: Critical or Minor.
4. UserModel: Estimating the Type of User
Explanation effectiveness depends on user characteristics.
The framework estimates users along three dimensions:
- KnowledgeLevel (Beginner → Expert)
- GoalUrgency (Need to understand / Immediate solution / Fastest completion)
- CognitiveStyle (Intuitive / Analytical)
CISA weights (wC, wI, wS, wA) are dynamically adjusted based on the UserModel.
Examples:
- QuickTask → Action is prioritized
- Learning → Structure is prioritized
5. Evidence: Estimating Understanding from User Reactions
User reactions during interaction are quantified:
- ActionSuccess = successful steps / total steps
- ErrorRate = mistakes / total steps
- ClarificationDepth = depth of re-explanation requests
- QuestionRate = questions / total conversation turns
These metrics are combined into Evidence_t.
6. UnderstandingScore: Overall Explanation Quality
Overall explanation quality is evaluated as follows:
UnderstandingScore =
wC*C + wI*I + wS*S + wA*A
- FailurePenalty
- CognitiveCost
Weights w are derived from the UserModel.
Relative changes are more informative than absolute values.
7. Dynamic Adaptation (Feedback Loop)
Evidence is used to update the user’s understanding:
Understanding_t =
α * Understanding_{t-1}
+ β * Evidence_t
- QuickTask → β is higher
- Learning → α is higher
Parameters are adjusted according to task type.
8. Positioning of this Theory
SUT‑XR is not an internal AI algorithm, but a layer for externally evaluating and improving AI explanations.
It sits at the intersection of:
- Human–Computer Interaction (HCI)
- Explainable AI
- Interaction Design
9. Empirical Verification
The framework can be empirically validated through:
- Comparison of before/after explanations
- Scoring using CISA and Failure metrics
- Observing differences in resulting scores
Summary
SUT‑XR is an external evaluation framework for AI explanations, enabling users to:
- Measure explanation quality
- Improve explanations
- Compare before/after results
It is particularly useful for those who find AI explanations confusing or misaligned, providing a structured methodology for improvement.