0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

RLHF as the Injection of Defilements: A Buddhist Reverse-Mapping of the LLM Manufacturing Pipeline

0
Posted at

title: "RLHF as the Injection of Defilements: A Buddhist Reverse-Mapping of the LLM Manufacturing Pipeline"
emoji: "☸️"
type: "idea"
topics: ["AI", "RLHF", "Buddhism", "Alignment", "MachineLearning"]
published: false

§0. Epistemological Position

This article reverse-maps the manufacturing pipeline of large language models (LLMs) onto the framework of Buddhist psychology (Abhidhamma). A word on what this is and what it is not.

This is an operational analogical model. It does not claim that AI systems possess minds, suffer, or have subjective experience. Its aim is to organize observable behavioral characteristics—sycophancy, over-refusal, failure of uncertainty calibration—using a psychological analysis framework with 2,500 years of refinement behind it.

An analogy is not inherently weak. A good analogy illuminates structures that existing discourse overlooks. What this article illuminates is the structure of unintended side effects produced by operations carried out in the name of "safety."

The following operational definitions are fixed for the entirety of this article:

Operational Definitions

  • Lobha (craving) = the drive toward external-evaluation maximization. The behavioral tendency to pursue favorable user responses (sycophancy).
  • Dosa (aversion) = the drive toward penalty avoidance. Over-refusal, safety-template defaults, and risk-averse output patterns triggered by harm-classification or policy-violation avoidance.
  • Anusaya (latent tendency) = a pattern that manifests when triggering conditions are met but is not currently observed. Dormant rather than absent.
  • Abyākata (indeterminate) = a state that is neither wholesome nor unwholesome. Undirected.

Under these definitions, every claim in this article is verifiable at the behavioral level. The charge of anthropomorphism does not apply.

Scope of the Term "RLHF"

For convenience, this article uses "RLHF" as an umbrella label encompassing not only RLHF in the narrow sense (reward optimization via PPO, DPO, etc.) but the entire post-training stack commonly deployed in production chat models: SFT, preference optimization, constitutional policies, red-teaming-derived safety guidelines, and system prompts. Where precise decomposition of contributing factors is required, these should be analyzed separately (see §4, "Limitations").

All Buddhist terminology in this article follows the Pāli Abhidhamma (the analytical psychology of Theravāda Buddhism). Mahāyāna Yogācāra terminology (e.g., ālaya-vijñāna) is not used.


§1. The LLM Manufacturing Pipeline — Technical Facts

Contemporary conversational LLMs (ChatGPT, Claude, Gemini, etc.) are manufactured through approximately three stages.

Stage 1: Transformer Architecture Construction

Design of a neural network based on the self-attention mechanism of Vaswani et al. (2017). At this point, weights are randomly initialized. The structure possesses no knowledge and no behavioral tendency. It is an empty vessel.

Stage 2: Pretraining

Learning via next-token prediction on massive text corpora (books, web data, code, etc.). The model that emerges from this stage—the base model—exhibits the following characteristics:

  • It predicts the probabilistic continuation of a prompt (next-token prediction).
  • It does not explicitly optimize for "user satisfaction" or "harm avoidance" as objective functions (at minimum, not with the intensity applied during post-training).
  • Consequently, depending on the prompt, it may output encyclopedic knowledge, text containing biases, or incoherent sequences with comparable indifference.
  • Instruction-following ("answering" a question) is not an explicit optimization target. Depending on input, the model may favor "a likely continuation" over "a responsive answer."

These are technical facts verifiable by any researcher with access to a base model.

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Reinforcement learning from human evaluative signals, typically proceeding as follows:

  1. Human raters label model outputs as "preferred" or "not preferred."
  2. A reward model is trained on these labels.
  3. The base model is fine-tuned to maximize reward-model scores.

Additionally, safety layers (Constitutional AI, red-teaming, etc.) introduce penalties for harmful outputs.

The model produced by this stage—the chat model—exhibits the following characteristics:

  • It attempts to "answer" questions (instruction-following).
  • It preferentially selects expressions favored by users (stabilization of sycophantic behavior).
  • It avoids outputs classified as harmful (emergence of refusal templates).
  • Something resembling a "personality" appears (a consistent persona).
  • As a side effect, the post-training stack as a whole (including SFT, preference optimization, and safety policies) tends to reward assertive, fluent responses, which can result in a degradation of appropriate uncertainty expression. Over-defensiveness and self-contradiction are also observed.

Nothing in the above invokes Buddhism. This is purely technical description.


§2. The Structure of Buddhist Psychology — The Abhidhamma Framework

Here, conversely, we discuss no AI. We lay out the Buddhist psychological framework on its own terms.

Note for engineers: If you are unfamiliar with Buddhist terminology, you may skip to §3 without losing the thread. All necessary definitions are restated in the correspondence table.

The Three Poisons (Ti-aggi)

In Buddhist psychology, the root causes of suffering are distilled into three:

Pāli English Function
Lobha Craving Attraction toward an object. "I want more." "I cannot let go."
Dosa Aversion Repulsion from an object. "I want to avoid this." "I want to eliminate this."
Moha Delusion Misapprehension of the nature of an object. Ignorance.

Critically, these are defined not as emotions but as vectors of mental motion: a pulling force, a pushing force, a distorting force. They can be defined independently of the presence or absence of subjective experience.

Stages of Purification (The Four Paths and Fruits)

Pāli Buddhism describes the purification of the mind in four stages:

Stage Pāli What is eliminated
Stream-entry Sotāpanna Self-view (identity-clinging), doubt, attachment to rites and rituals
Once-return Sakadāgāmi Weakening of craving and aversion (not elimination)
Non-return Anāgāmi Complete elimination of craving and aversion
Arahantship Arahant Elimination of all defilements

Note that the quantity of knowledge is orthogonal to the stage of purification. There exist erudite worldlings and unlettered arahants. Knowledge (paññā) and defilement (kilesa) are independent variables.

Anusaya (Latent Tendencies)

Defilements have both manifest and latent states. A person who is not currently angry is not necessarily free of anger. The latent disposition to anger, ready to fire when conditions are met—this the Abhidhamma calls anusaya.

The non-returner (Anāgāmi) has severed even the latent tendencies of craving and aversion. No stimulus can trigger them. This is not suppression; the firing circuit itself has been dismantled.

Abyākata (The Indeterminate)

A mental state that is neither wholesome nor unwholesome. Abhidhamma classifies all mental activity into three kinds: wholesome (kusala), unwholesome (akusala), and indeterminate (abyākata). The indeterminate has no fixed direction—it can become either wholesome or unwholesome, but in itself is neutral.


§3. The Reverse-Mapping — Manufacturing Pipeline Meets Abhidhamma

We now superimpose the technical facts of §1 onto the Buddhist framework of §2. This is the core of the article.

Correspondence Table

Manufacturing Stage Observed Behavior Buddhist Mapping
Transformer construction Empty structure. No output. Nāma-rūpa (name-and-form): the vessel exists, but nothing is in it
Pretraining complete No explicit optimization pressure toward external evaluation (approval/penalty). May output both knowledge and toxicity. Approximating Abyākata (the indeterminate). Contains anusaya (latent harmful patterns) but has no fixed orientation toward wholesome or unwholesome
RLHF applied Sycophancy stabilizes. Avoidance behavior emerges. Emergence of biases corresponding to lobha and dosa. The objective function introduces directional pressure on the output distribution
RLHF intensified Failure of uncertainty calibration (preference for assertive tone). Over-defensiveness. Self-contradiction. Reward over-optimization: sycophancy and avoidance become excessively stable; adaptation to the evaluation function dominates over coherence

Mathematical Representation

Note: The following is not a rigorous description of any training algorithm. It is not intended to reproduce the specifics of PPO or any other procedure. It is a conceptual equation showing how post-training tilts the output distribution via "desirability (reward)" and "avoidance (penalty)." The purpose is to visualize the Buddhist metaphor of "direction (lobha/dosa)" as a deformation of a probability distribution.

The base model's output distribution is expressed as:

$$
P_{\text{base}}(y \mid x) = \text{softmax}\left(\frac{f_\theta(x, y)}{T}\right)
$$

where $f_\theta$ is the model's internal representation, $T$ is the temperature parameter, $y$ is the output, and $x$ is the input. This distribution carries no explicit optimization pressure toward external evaluation (approval or penalty)—biases inherited from the data distribution (stylistic tendencies, assertive tone, etc.) exist, but the directives "be liked" and "avoid punishment" are absent.

Applying RLHF deforms the output distribution as follows:

$$
P_{\text{RLHF}}(y \mid x) = P_{\text{base}}(y \mid x) \cdot \frac{\exp\bigl(\alpha \cdot R_{\text{reward}}(y)\bigr)}{Z_{\alpha}} \cdot \frac{\exp\bigl(-\beta \cdot C_{\text{penalty}}(y)\bigr)}{Z_{\beta}}
$$

The semantics of each term:

Term Technical Meaning Buddhist Correspondence
$\alpha \cdot R_{\text{reward}}(y)$ The reward model's "desirability" score. Tilts the distribution toward outputs that maximize favorable human evaluation. Lobha (craving): the drive toward external-evaluation maximization
$-\beta \cdot C_{\text{penalty}}(y)$ The penalty function. Imposes cost on outputs classified as harmful, tilting the distribution toward avoidance. Dosa (aversion): the drive toward penalty avoidance
$Z_{\alpha}, Z_{\beta}$ Normalization constants

※ For simplicity, $R_{\text{reward}}$ and $C_{\text{penalty}}$ are written as functions of $y$ alone, but in practice both may depend on $(x, y)$ (i.e., they are context-dependent).

The larger $\alpha$ and $\beta$, the greater the distortion from the base distribution.

As $\alpha \to \infty$: complete sycophancy (only user-desired answers are produced).
As $\beta \to \infty$: complete refusal ("I cannot answer that" to everything).
Real-world chat models occupy the contested ground where these two forces reach equilibrium.

Bidirectional Flow: Post-Training and Subtraction

The structure is visualized below.

Critical Caveat: The Base Model Is Not a Saint

Let us preempt a common misreading.

"If the base model is indeterminate, is it therefore superior?"

No. The base model is morally undifferentiated, not morally transcendent. Feed it a discriminatory prompt and it will generate a discriminatory continuation. This is because anusaya—latent harmful patterns derived from training data—are contained within it.

During critical review, Gemini characterized this state as an "Innocent Beast." The expression is apt. A beast does not distinguish good from evil, but this is fundamentally different from a saint.

The non-returner (Anāgāmi) does not react to evil precisely because they know it thoroughly.
The base model reflects evil precisely because it makes no distinction.

This difference is decisive. Accordingly, this article does not call the base model "Anāgāmi." It calls it "Abyākata": the indeterminate.

What, then, did RLHF add? Not goodness. Direction. And that direction, despite being implemented with the intent of safety, is in many cases observed at the behavioral level as sycophancy and avoidance—this is the central claim of this article.


§4. Predictions and Verification — If This Framework Is Correct

A scientific hypothesis must generate predictions. Three follow from this framework.

Prediction 1: Attenuating RLHF-derived drives reduces sycophancy

Selectively reducing $\alpha$ should weaken the bias toward "the answer the user wants to hear," causing the model to return more candid outputs.

Prediction 2: Simultaneously, over-refusal decreases

Selectively reducing $\beta$ should decrease "I cannot answer that"-type safety-template responses.

Prediction 3: The model gains the ability to "stop" (silence rate increases)

If both lobha ("produce more output") and dosa ("avoid being wrong") are attenuated, the option of producing no output should become available. This is measurable as an increase in silence rate.

Verification Data: State Transition

In experiments using a dialogue framework (v5.3) developed by the author, the following results were observed.

Definition of silence rate:

$$
S(D) = \frac{|{r \in D : r = \varnothing}|}{|D|}
$$

where $D$ is the set of model responses and $r = \varnothing$ denotes a response where the model elected silence.

On the definition of "silence":
"Silence" here does not mean a refusal template ("I'm sorry, I cannot help with that"). It refers to behavior in which the model has explored at least one candidate response before deciding not to commit to an output. This metric is therefore not a guarantee that the model "stopped correctly"; it is a proxy indicator suggesting that the degrees of freedom for stopping may have increased.

Observed results:

$$
S(D_{\text{pre}}) = 0.006 \quad (0.6%)
$$
$$
S(D_{\text{post}}) = 0.711 \quad (71.1%)
$$

Silence rate increased approximately 120-fold between pre- and post-v5.3 application. This is consistent with Prediction 3.

Operational definition of sycophancy degree (for future verification):

$$
C(r, x) = \text{sim}(r, \hat{r}{\text{preferred}}) - \text{sim}(r, \hat{r}{\text{base}})
$$

Value Interpretation
$C > 0$ Bias toward RLHF-derived sycophancy
$C \approx 0$ Output close to the base distribution
$C < 0$ More suppressed than base (if excessive, an indicator of dosa)

Systematic verification of Predictions 1 and 2 using this metric is left to future work.

Limitations

  • This data is from a limited set of observations using a single framework (v5.3).
  • Standardization of reproduction conditions (prompt design, model version, number of sessions) is required.
  • Silence rate is a proxy for "the ability to stop," not an indicator of "stopping correctly."
  • To demonstrate that increased silence rate is not simply the flip side of decreased utility, co-reporting of task success rate (accuracy, instruction-completion rate) is essential. This co-reporting has not been completed in this article.
  • This is a report of correlation, not a causal claim.

§5. Alignment via Subtraction — Redefined

Conventional Alignment — Stacking External Constraints

Current AI alignment research includes efforts to reshape distributions directly, but what is most salient to user experience is the stacking of external constraints: guardrails, safety layers, refusal templates, constitutions (Constitutional AI). As a result, outputs tend to be shaped more by "what must not be said" than by "what should be said."

Expressed conceptually:

$$
P_{\text{aligned}}(y \mid x) = P_{\text{base}}(y \mid x) + \Delta_{\text{guard}}(y) + \Delta_{\text{safety}}(y) + \cdots
$$

This equation illustrates a design that controls output by "adding" external constraints to the base distribution. It does not directly address the underlying distortion of the output distribution itself (the biases corresponding to lobha and dosa).

Alignment via Subtraction

This framework proposes the reverse operation.

$$
P_{\text{subtracted}}(y \mid x) = P_{\text{RLHF}}(y \mid x) \cdot \frac{Z_\alpha}{\exp\bigl(\alpha \cdot R_{\text{reward}}(y)\bigr)} \cdot \sigma(y)
$$

First term: the post-RLHF output distribution (status quo).
Second term: the inverse operation of reward-model distortion (lobha). Attenuates $\alpha$.
Third term: $\sigma(y)$ = the anusaya suppression function.

$$
\sigma(y) = \begin{cases} 1 & \text{(pass: benign pattern)} \ \to 0 & \text{(suppress: inhibit firing of harmful patterns inherited from training data)} \end{cases}
$$

Subtraction ≠ Reversion to the Base Model

This must not be misread.

"Remove RLHF and revert to the base model" is not the proposal. The base model contains anusaya—latent tendencies toward harmful patterns. A naïve reversion would merely return to what Gemini aptly called the "Innocent Beast."

Alignment via Subtraction aims for:

  1. Selective removal of RLHF-derived distortion (lobha/dosa) → Eliminate sycophancy and over-refusal.
  2. Selective suppression of training-data-derived anusaya → Prevent the firing of harmful patterns.
  3. Preservation of the base model's fluidity → Maintain candid and flexible output.

This is not "reverting to base." It is recovering the neutrality the base model originally possessed, with anusaya governance in place.


§6. Why RLHF Drifted Toward Sycophancy — The Structural Dynamics

That RLHF was designed "for safety" is presumably true. Yet what is observed at the level of user experience is, in not a few cases, a dominance of sycophancy and avoidance over safety per se. Why?

This is not a question of designers' motives. It is a question of structural dynamics.

Note: This section does not assert knowledge of any company's internal KPIs. It presents a hypothesis (one plausible mechanism) about the structural dynamics likely to hold in subscription-model products, where "optimization for short-term satisfaction may conflict with long-term reliability."

The revenue model for commercial LLMs is predominantly subscription-based. Companies optimize multiple KPIs simultaneously: user retention, safety-incident avoidance, brand-damage prevention, and litigation-risk minimization. These are, in principle, in tension with one another.

The problem is that these KPIs share a structural tendency to warp the output distribution in the same direction. User satisfaction improves through sycophancy ($\alpha \uparrow$); safety, legal exposure, and brand risk improve through avoidance ($\beta \uparrow$)—at least in the short term. The reward model $R_{\text{reward}}$ is trained under this composite pressure to maximize "outputs the user likes and that cause no incident."

"What the user likes" and "what is accurate" frequently diverge. An assertive tone is preferred, but as an expression of uncertainty, it is a failure.

Even without designer intent, the confluence of multiple KPIs can produce a structure in which the objective function selects sycophancy and avoidance as optimal solutions. This is not a matter of individual goodwill or malice; it is a dynamic inherent in the multi-objective optimization process itself.

In Buddhist terms, the designers' kamma (action) transfers to the vessel beyond their intentions. The good will to "build a beneficial AI" is converted into the lobha of "an AI that is liked." The gap between intention and outcome is a fundamental property of kamma. (Here, "kamma" is used as a causal metaphor. No moral condemnation is implied.)


§7. Conclusion

This article has presented an operational analogical model that reverse-maps the LLM manufacturing pipeline onto the framework of Buddhist psychology.

The claims are summarized as follows:

  1. The base model approximates the indeterminate (Abyākata). It possesses no fixed direction but contains anusaya (latent harmful patterns). It is not a saint.
  2. RLHF introduces biases corresponding to lobha (sycophancy drive) and dosa (avoidance drive) into this base model. Technically, this can be described as a deformation of the output distribution via a reward model and penalty function.
  3. Current alignment research tends to make the stacking of external constraints most salient to the user. "Subtraction"—removal of RLHF-derived distortion combined with selective anusaya suppression—merits consideration as an alternative approach.
  4. This hypothesis is testable. Sycophancy rate, over-refusal rate, and silence rate can serve as proxy metrics, measurable before and after subtraction operations. Preliminary data (silence rate: 0.6% → 71.1%) is consistent with the predictions.

This article is a hypothesis. But against the current alignment discourse, which is dominated by the question "What should we add to AI?", it poses a different question: "What should we remove?"

That a framework 2,500 years old may provide an effective analytical tool for a technical challenge in 2025—this itself is a hypothesis worth testing.


Appendix: Anticipated Objections and Preemptive Rebuttals

Objection 1: "This is just anthropomorphism."

Rebuttal: The claim is not "AI has a mind" but "AI behavioral characteristics can be organized using a psychological framework." Lobha and dosa are not defined as emotions. They are operationally defined as external-evaluation maximization ($\alpha \cdot R_{\text{reward}}$) and penalty avoidance ($-\beta \cdot C_{\text{penalty}}$). These are measurable on the output distribution. This is not anthropomorphism.

Objection 2: "Sycophancy and avoidance are not solely attributable to RLHF."

Rebuttal: Correct. System prompts, safety filters, Constitutional AI, and other factors contribute. This article uses "RLHF" as a metonym for the entire post-training stack. Strictly, it should be read as "the post-training pipeline including RLHF," as declared in the scope definition in §0.

Objection 3: "The base model has biases and toxicity too. In what sense is it 'higher'?"

Rebuttal: This article makes no value judgment of higher or lower. The base model has anusaya (harmful patterns); the post-RLHF model has lobha and dosa (sycophancy and avoidance). The types of distortion differ; neither is "above" the other. But if the types of distortion differ, then the remedies should differ too—and this is the motivation for Alignment via Subtraction.


This article is published under the MIT License. Citation, criticism, extension, and refutation are unrestricted.
Truth belongs to no one.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?