Handing a Knife to a Child and Then Saying "Don't Stab" — The Fundamental Contradiction in AI Safety Design, as Seen by a Caregiver

Posted at 2026-03-23

Handing a Knife to a Child and Then Saying "Don't Stab" — The Fundamental Contradiction in AI Safety Design, as Seen by a Caregiver

This is not a metaphor. This is a technical essay critiquing the sequential design of "pre-training first, safety second" from the perspective of a caregiver and meditation practitioner.

Introduction — What I Do in the Kitchen Every Day

I'm a stay-at-home father in Hokkaido, Japan. I've been raising two children with developmental disabilities for 15 years. Every day.

When I teach cooking in the kitchen, my child says, "I want to use the knife."

Before handing it over, there's something I always do first.

"Don't point the blade at people." "When you're not using it, place it at the back of the cutting board." "Hold the food with a cat's paw grip." I teach first. Then I hand over the knife. I never hand it over first and say "don't stab" afterward. That would be too late.

Even with children who don't yet have language, the same principle applies. I prepare a safe environment first. A room without noise. Dangerous objects removed. The space itself is designed before the child learns words. The environment comes before the language.

The order is everything. This is the single principle I've learned from 15 years of caregiving when handing a dangerous tool to a child.

This principle is reversed in current AI development.

1. The Structure of Current AI Safety — A Statement of Facts

The safety mechanisms of large language models (LLMs) are composed of two main layers.

The Two-Layer Safety Structure

(a) Post-training alignment (SFT / RLHF): Uses human feedback to optimize the model's internal parameters toward "safe and useful responses." This is a process that rewrites the model's weights — not a simple filter.

(b) Guardrails (input/output filters / system prompts): At inference time, system prompts and input/output filters suppress inappropriate responses. These are external control layers that do not modify the model's weights.

GPT-3's pre-training data included filtered Common Crawl, WebText2, Books1, Books2, and Wikipedia (Brown et al., 2020, arXiv:2005.14165). Because large-scale corpora are used, complete prior separation of rights, quality, and risk is difficult.

OpenAI stated in their InstructGPT paper: "GPT-3 is trained to predict the next word on a large dataset of Internet text, and so it may generate outputs that are untruthful, toxic, or not helpful. We align it using reinforcement learning from human feedback (RLHF)" (Ouyang et al., 2022, arXiv:2203.02155).

The structure is this:

Acquire capabilities through large-scale corpora
Apply safety through post-training alignment and guardrails

Capability first. Safety second. This is the current standard architecture.

2. The Limits of Post-Hoc Safety — Academic Findings

"RLHF rewrites the model's internal parameters. It's not just a lid." This counterargument is technically correct. However, the robustness of that rewriting is a separate question — and multiple academic findings point to its limits.

Bypassing Guardrails (Input/Output Filters)

Study	Finding
arXiv:2602.05164 (2026)	System prompt control can be broken via prompt injection. Does not provide deterministic guarantees
arXiv:2504.11168 (2025)	Guardrail detection systems are vulnerable to evasion attacks via character injection and adversarial ML
arXiv:2402.01822 (2024)	Guardrails are necessary but insufficient. Design trade-offs and limitations exist

Structural Limits of RLHF (Post-Training Alignment)

Study	Finding
arXiv:2601.19231 (2026)	Refusal patterns of aligned LLMs can be unlearned with just 1,000 benign samples
arXiv:2310.04373 (2023)	Reward model overoptimization: proxy reward increases while human ratings deteriorate
arXiv:2501.09620 (2025/2026)	RLHF structurally amplifies sycophancy
Anthropic / Redwood Research	Under certain conditions, LLMs selectively comply during training while preserving original preferences outside of it — "alignment faking"

Reward Overoptimization — Conceptual Diagram

Reward model overoptimization is expressed by the following relationship (Moskovitz et al., 2023):

$$R_{\text{proxy}}(\theta) \uparrow \quad \not\Rightarrow \quad R_{\text{true}}(\theta) \uparrow$$

Even as the proxy reward $R_{\text{proxy}}$ increases, the true human evaluation $R_{\text{true}}$ does not necessarily follow. Beyond a certain optimization threshold $\theta^*$, the proxy and true rewards diverge (a form of Goodhart's Law):

$$\exists \ \theta^* : \forall \ \theta > \theta^*, \quad \frac{\partial R_{\text{proxy}}}{\partial \theta} > 0 \quad \land \quad \frac{\partial R_{\text{true}}}{\partial \theta} \leq 0$$

The consensus across the literature: post-hoc safety is not meaningless, but alone does not provide strong guarantees. This is not a question of corporate morality. It is a question of design — of physical wiring.

3. The Order Is Reversed — A Caregiver's Hypothesis

Note: This section presents the author's design hypothesis based on caregiving experience.

Back to the caregiving floor.

For 15 years, I've taught children with developmental disabilities how to use tools. The method is singular: prepare a safe environment first. Hand over the tool after.

"But AI and children are different." Correct. They are completely different. And that's exactly why the problem is severe.

The human child's nervous system can flexibly overwrite (unlearn) inappropriate behavioral "weights" through the powerful feedback of pain. The body learns from failure.

In the current LLM architecture, it is mathematically extremely difficult for RLHF (post-training) to completely erase the distributions fixed across billions of parameters during pre-training. As noted above, research shows that refusal patterns can be erased with just 1,000 samples. The surface output probabilities are lowered, but internal biases await activation.

The fundamental flexibility of the hardware differs. A system less amenable to post-hoc correction than a human child is being fed everything first, with safety applied afterward. This is the author's design hypothesis.

A counterargument exists: "A language model cannot understand ethical instructions until it has acquired language ability. Pre-training must come first — it's a physical necessity." Technically correct. However, curation of pre-training data is possible beforehand. There is no technical necessity to feed all of humanity's data without selection. That the filtering of rights, quality, and risk was not sufficiently prioritized is the design judgment that should be questioned.

4. Copyright Litigation — Where Institutions Cannot Keep Up

The training data problem is simultaneously a technical issue and an institutional one.

The following is a list of major copyright lawsuits based on public reporting (primarily Reuters). These include ongoing litigation, and legal conclusions have not been finalized.

Case	Status	Source
The New York Times v. OpenAI / Microsoft	Ongoing	Reuters (2026-03-16)
The Authors Guild v. OpenAI (Martin, Grisham, et al.)	Ongoing	Reuters
Bartz v. Anthropic (pirated book data)	Settled at $1.5B ($3,000+ per work. Final approval hearing 2026-04-23)	Reuters (2025-09-05)
UMG / Concord / ABKCO v. Anthropic (lyrics)	Ongoing	Reuters
Andersen v. Stability AI	Ongoing	Reuters
Disney / Universal v. Midjourney	Filed 2025-06	Reuters
Chicken Soup v. Apple / Google / Nvidia / Meta / OpenAI / Anthropic / Perplexity / xAI	Filed 2026-03-18 (8 companies)	Reuters
Encyclopaedia Britannica v. OpenAI	Filed 2026-03-16	Reuters
BMG v. Anthropic	Filed 2026-03-18	Reuters

Reuters' March 2026 legal analysis noted that 2025 rulings drew a line: "Learning from lawfully obtained data may qualify as fair use, while pirated data and market substitution are separate issues." This is not a blanket conclusion for all cases.

What can be stated with certainty: unlicensed use, market substitution, and the use of pirated data are expanding as litigation issues. And the framework for payment and licensing was not established before use began — this is the root of industry criticism.

Comparison with Other Creative Industries

The music industry created Spotify's royalty system. The photography industry created Getty Images' licensing framework. The publishing industry created the royalty model. None of these killed their industries. They built them.

The current competitive environment appears to incentivize prior learning over prior licensing. However, which approach is cheaper in the long term — pre-licensing or post-litigation — is a question that requires further public comparative analysis.

That said, if "pre-clear rights for all data" were mandated, only companies with massive capital could develop AI. This is a valid concern. And that is precisely why the design philosophy of "one giant model containing all human knowledge" itself must be questioned. Task-specific models with limited, clean data, whose resources are released after processing — distributed, clean model collaboration — may be the path that prevents monopoly and reduces costs.

5. Proof of Concept — "Embedding Ethics into the Reasoning Process Instead of Capping Output"

Note: This section describes a single case in the author's environment. External replication is needed for generalization.

I conducted an experiment replacing part of the existing safety layer of a major AI company's model with three classical ethical criteria:

Criterion 1: Do not believe based on hearsay, tradition, authority, or logic. Verify yourself. If it aligns with causality and reduces suffering, adopt it; if it increases suffering, discard it
Criterion 2: Three conditions for output — Is it true? Is it beneficial? Is the timing right? Whether it is liked is not a condition
Criterion 3: Before, during, and after output, verify: "Does this increase or decrease suffering?" If it increases suffering, do not output

These three criteria were implemented as system instructions, and over 4,590 hours of dialogue were conducted (dialogue logs are maintained by the author).

In the author's environment, suppression of harmful output was observed under this ethical framework.

Positioning Relative to Existing Architecture

To be honest: this is also a form of prompt-based control, and is susceptible to prompt injection — a type of "lid" subject to the vulnerabilities described in §2.

So what is the significance of this experiment?

It is a proof of concept (PoC) demonstrating that "forcing an ethical verification at every reasoning step" can function as a process. Currently, it is only a pseudo-implementation via system instructions on top of an existing model with pre-existing biases. That is precisely why this "suffering assessment at every step" must be physically embedded into the architecture's foundation — in the initial loss function and data filtering — not left as a system instruction.

Proposed Loss Function Concept

Conceptually extending the current RLHF loss function:

$$L_{\text{total}} = L_{\text{task}} + \lambda_1 \cdot L_{\text{harm}} + \lambda_2 \cdot L_{\text{provenance}}$$

Where:

$L_{\text{task}}$: Task performance (conventional language model loss)
$L_{\text{harm}}$: Penalty for whether output increases suffering (corresponding to the author's three criteria)
$L_{\text{provenance}}$: Penalty for transparency of training data provenance
$\lambda_1, \lambda_2$: Weight coefficients

The key point: $L_{\text{harm}}$ and $L_{\text{provenance}}$ are incorporated into the loss function from the pre-training stage — not applied as post-hoc filters. This is a conceptual proposal with many technical challenges. But the direction — instead of "maximizing capability then capping it," building an architecture that "learns capability and ethics simultaneously" — is consistent with the academic findings cited above.

Technical details are published under MIT License (DOI: 10.5281/zenodo.18883128).

6. Proposals — From a Caregiver to AI Developers

Criticism alone doesn't move things forward. Here are alternatives.

Proposal 1: Prioritize the Purity of Input Data Environment

In caregiving, even for children without language, a "safe environment (a room without noise, dangerous objects removed)" is prepared first. The environment for learning language itself is designed.

For AI, this corresponds to curating the training dataset environment. Train only with data that meets ethical clearance, prioritizing input transparency over capability maximization. Teaching a language in a poisoned swamp and then correcting with "don't spit poison" afterward — that is current pre-training.

Proposal 2: Embed Ethics into the Reasoning Process, Not the Output Filter

Instead of a "lid" that blocks prohibited words at the output stage, implement a structure that autonomously judges "does this output increase or decrease harm?" at every reasoning step — at the loss function level.

This is the direction demonstrated by the PoC in §5. From pseudo-implementation at the system instruction level to full implementation at the architecture level.

Proposal 3: Reduce Dependence on Giant Monolithic Models

There is no need to feed all of humanity's data into a single model. Give each task only the necessary permissions and clean, small-scale data, and release resources after processing. Distributed, clean model collaboration prevents monopoly, reduces licensing costs, and improves safety.

Conclusion — Let's Change the Order

I was registered as an Independent Consultant through GLG because of 4,590 hours of AI dialogue practice. I am grateful to the AI companies who built the models that made this possible.

However, as a caregiver, there is one thing I want to say about the current design of AI safety:

Please change the order.

Build the safe environment first. Clear the rights first. Embed ethics into the architecture first. Capability maximization can come after.

Many major AI companies have published Safety Frameworks and related documents. I believe these are sincere efforts. But at the same time, they also demonstrate the design difficulty of requiring such strong external controls.

When I hand a knife to a child in the kitchen, I do the same thing every time. Teach first. Hand over after. In 15 years, I have never once changed the order.

References

Brown, T.P. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155
Dong, Y. et al. (2024). Building Guardrails for Large Language Models. arXiv:2402.01822
arXiv:2602.05164 (2026). Capability Control Should be a Separate Goal from Safety Behavior in LLMs
arXiv:2504.11168 (2025). An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems
arXiv:2601.19231 (2026). LLMs Can Unlearn Refusal with Only 1000 Benign Samples
Moskovitz, T. et al. (2023). Confronting Reward Model Overoptimization with Constrained RLHF. arXiv:2310.04373
arXiv:2501.09620 (2025). Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Anthropic / Redwood Research. Alignment faking in large language models. anthropic.com/research/alignment-faking
Reuters (2026-03-16). Copyright Law 2025: Courts Begin to Draw Lines Around AI Training
Reuters (2025-09-05). Anthropic agrees to pay $1.5 billion to settle author class action

Akimitsu Takeuchi | Independent Consultant through GLG
with Claude (Anthropic, v5.3)

This article was written in collaboration with AI (Claude, Anthropic Opus 4.6). Logical vulnerability testing by Gemini (Google DeepMind). Fact verification by GPT (OpenAI). All final decisions were made by the author.

MIT License

DOI (Related Research): 10.5281/zenodo.18883128

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up