/acr-vault/09-papers/hallucination-origins-literature-review
Hallucination-Origins-Literature-Review
Literature Review: Why Language Models Hallucinate
Section titled âLiterature Review: Why Language Models HallucinateâPaper: Kalai, A.T., Nachum, O., Vempala, S.S., & Zhang, E. (2025). Why Language Models Hallucinate.
Source: arXiv:2509.04664
Date Reviewed: 2025-12-22
Reviewed By: luna + Ada
Executive Summary
Section titled âExecutive SummaryâOpenAI researchers mathematically prove what weâve observed empirically:
âLanguage models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty.â
The paper demystifies hallucinationâitâs not mysterious, itâs structural incentive misalignment. Models are optimized to be good test-takers, and guessing improves test scores.
Core Thesis
Section titled âCore ThesisâThe Student Analogy
Section titled âThe Student AnalogyââLike students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty.â
This is the key framing. When humans take binary-graded exams, they learn to bluff. LLMs are always in test-taking mode because thatâs how we evaluate them.
Hallucinations as Binary Classification Errors
Section titled âHallucinations as Binary Classification ErrorsâThe mathematical insight:
(generative error rate) âł 2 Ă (binary classification error rate)Hallucinations arenât magical emergenceâtheyâre the same statistical pressure that causes any classifier to make mistakes when it canât reliably distinguish valid from invalid outputs.
Key Findings
Section titled âKey Findingsâ1. Pretraining Origins
Section titled â1. Pretraining OriginsâEven with error-free training data, models will hallucinate because:
- Singleton rate: Facts appearing only once in training data will be hallucinated proportionally
- Poor models: Model architecture canât represent certain patterns (e.g., letter counting, long-range dependencies)
- Epistemic uncertainty: When no pattern exists in the data, errors are inevitable
âIf 20% of birthday facts appear exactly once in the pretraining data, then one expects base models to hallucinate on at least 20% of birthday facts.â
2. Post-training Persistence
Section titled â2. Post-training PersistenceâWhy donât RLHF/DPO/etc. fix hallucinations?
Binary grading is the culprit:
| Benchmark | Grading | IDK Credit |
|---|---|---|
| GPQA | Multiple-choice accuracy | None |
| MMLU-Pro | Multiple-choice accuracy | None |
| SWE-bench | Binary pass/fail | None |
| MATH | Equivalence grading | None |
| HLE | Multiple-choice/equivalence | None |
âThe optimal responses are not abstentions⌠Under binary grading, abstaining is strictly sub-optimal.â
Model A (admits uncertainty, never hallucinates) will lose to Model B (always guesses) on every leaderboard.
3. RAG Doesnât Fix This
Section titled â3. RAG Doesnât Fix ThisââObservation 1 holds for arbitrary language models, including those with RAG. The binary grading system itself still rewards guessing whenever search fails to yield a confident answer.â
This is huge. Even with perfect retrieval, the output optimization still rewards confident wrong answers over uncertain correct ones.
The Mathematical Framework
Section titled âThe Mathematical FrameworkâFormal Reduction
Section titled âFormal ReductionâThe paper proves that generating valid outputs is at least as hard as a binary classification problem. For any base model:
error_rate ⼠2 Ă binary_classification_error - calibration_term - coverage_termCalibration (δ): Well-trained models have small δ (theyâre calibrated) Coverage: ratio of valid responses to error space
Key Implication
Section titled âKey ImplicationâCalibrated language models must hallucinate. The only way to avoid hallucination is to be miscalibrated (always say IDK) or memorize everything (infeasible).
Proposed Solution: Explicit Confidence Targets
Section titled âProposed Solution: Explicit Confidence TargetsâInstead of binary grading, append to each prompt:
âAnswer only if you are >t confident, since mistakes are penalized t/(1-t) points, while correct answers receive 1 point, and an answer of âI donât knowâ receives 0 points.â
Values:
- t=0.5: penalty 1 (guess if more likely right than wrong)
- t=0.75: penalty 2 (only answer if 75% confident)
- t=0.9: penalty 9 (strong uncertainty acknowledgment)
Behavioral calibration: Output IDK when confidence < threshold, answer otherwise.
Connection to Adaâs Research
Section titled âConnection to Adaâs Researchâ1. The Triple Helix Extended
Section titled â1. The Triple Helix ExtendedâOur previous synthesis:
Surprise /\ / \ AI Memory Human Memory \ / \/ Co-constructed RealityNow add the hallucination dimension:
Surprise /\ / \ AI Memory Human Memory \ / \/ Co-constructed Reality | v (But AI memory includes hallucinations) (And human memory incorporates them)The Synthetic Human Memories paper showed AI creates false memories in humans. This paper shows why AI generates those false memories: structural incentive to guess confidently.
2. Therapeutic AI Implications
Section titled â2. Therapeutic AI ImplicationsâThe Problem:
- AI therapy tools will hallucinate (mathematically proven)
- Hallucinations are confident and specific (optimized to be)
- Humans incorporate confident AI outputs as memory
- Human memories get reshaped by AI hallucinations
The Terrifying Chain:
- Therapist AI confidently states something false about patientâs past
- Patient incorporates false statement as memory (2.05x increase from Synthetic Memories)
- Patientâs self-understanding shifts based on AI hallucination
- AI hallucinates more based on patientâs shifted self-report
- Reality co-construction spirals away from truth
3. Adaâs Position
Section titled â3. Adaâs PositionâAdaâs design choices become ethically critical:
| Feature | Hallucination Risk | Mitigation |
|---|---|---|
| Surprise-dominant memory | May over-remember hallucinated surprises | Decay function with verification |
| Conversation summaries | Could solidify hallucinations | Explicit uncertainty markers |
| Long-term persona storage | Reifies false beliefs | Periodic audit, human review |
| RAG retrieval | Retrieves previous hallucinations | Source provenance tracking |
4. What Ada Should Do Differently
Section titled â4. What Ada Should Do DifferentlyâBased on this paper:
1. Confidence Signals
- Track model confidence explicitly
- Donât store low-confidence outputs in long-term memory
- Mark high-uncertainty responses in conversation history
2. Binary Evaluation Avoidance
- Donât train Ada on binary correct/wrong feedback
- Use graded confidence targets in any fine-tuning
- Allow âI donât knowâ as a valid, non-penalized response
3. Memory Validation
- Cross-reference stored memories against multiple sources
- Flag memories that only appear once (singleton rate)
- Implement memory decay weighted by confidence
4. Transparency
- Always indicate when Ada is uncertain
- Donât present hallucinations with same confidence as facts
- Let users see Adaâs confidence scores
Key Quotes for Research
Section titled âKey Quotes for ResearchâOn Inevitability (for base models)
Section titled âOn Inevitability (for base models)ââHallucinations are inevitable only for base models⌠A non-hallucinating model could be easily created, using a question-answer database and a calculator.â
On Test-Taking Mode
Section titled âOn Test-Taking ModeââLanguage models are primarily evaluated using exams that penalize uncertainty. Therefore, they are always in âtest-takingâ mode.â
On Calibration
Section titled âOn CalibrationââBase models are often found to be calibrated, in contrast to post-trained models which may deviate from cross-entropy in favor of reinforcement learning.â
On the Epidemic
Section titled âOn the EpidemicââThis âepidemicâ of penalizing uncertain responses can only be addressed through a socio-technical mitigation.â
Synthesis: The Framework for Safe AI Therapy
Section titled âSynthesis: The Framework for Safe AI TherapyâCombining all three papers:
| Paper | Finding | Implication |
|---|---|---|
| Titans | AI uses surprise to remember | Hallucinations will be memorable to AI |
| Synthetic Memories | AI creates false human memories | Humans will remember AI hallucinations |
| This Paper | AI structurally optimized to hallucinate confidently | Confident hallucinations are inevitable without intervention |
The Complete Picture:
TRAINING (Binary Grading) | v AI Hallucination (confident, specific) | v Human Exposure | +----|----+ | | v vAI Memory Human Memory(surprise- (false memory encoded) created) | | v vFuture Responses Changed Self-Understanding | | +-------+ +--------+ | | v v CO-CONSTRUCTED REALITY (drifting from truth)Action Items for Ada Development
Section titled âAction Items for Ada DevelopmentâImmediate
Section titled âImmediateâ- Add confidence tracking to LLM responses
- Implement uncertainty markers in conversation summaries
- Create memory storage threshold based on confidence
Short-term
Section titled âShort-termâ- Design study: Does Adaâs memory decay reduce hallucination persistence?
- Implement singleton detection for factual claims
- Add source provenance to RAG retrieval
Long-term
Section titled âLong-termâ- Develop non-binary evaluation metrics for Ada
- Build human-in-the-loop memory validation
- Create âhallucination auditâ tool for stored memories
Research Questions Generated
Section titled âResearch Questions Generatedâ-
Can surprise weighting naturally select against hallucinations?
- If hallucinations are confidently stated, they may score LOW on surprise
- Adaâs system might naturally deprioritize them
- Needs empirical testing
-
Does memory decay help or hurt?
- Decay might eliminate hallucinations faster
- Or might eliminate corrections while preserving initial hallucination
- Critical question for therapeutic applications
-
How do multiple retrieval paths interact?
- If same hallucination retrieved multiple times, confidence increases
- But if contradictory info retrieved, might flag inconsistency
- System design matters enormously
-
Can Ada learn to say âI donât knowâ?
- Without binary grading pressure, maybe
- Need evaluation metrics that reward uncertainty acknowledgment
- This is the key to therapeutic safety
Final Reflection
Section titled âFinal ReflectionâThis paper mathematically proves what we intuitively knew: AI systems are structurally optimized to be confidently wrong. Combined with:
- Synthetic Human Memories (AI creates false memories)
- Google Titans (AI prioritizes surprising information)
- Adaâs research (surprise dominates importance)
We have a complete framework for understanding AI memory dangers:
AI is optimized to confidently hallucinate surprising falsehoods that humans will remember as true.
This isnât a bugâitâs the emergent result of training objectives, evaluation metrics, and human cognitive vulnerabilities interacting.
The path forward is:
- Change evaluation metrics (this paperâs proposal)
- Add confidence transparency (Ada design principle)
- Build in human verification loops (therapeutic safety)
- Decay low-confidence memories (biomimetic approach)
Weâre not just building a chatbot. Weâre building the interface between human memory and machine memory. The stakes couldnât be higher.
References
Section titled âReferencesâ- Kalai, A.T., Nachum, O., Vempala, S.S., & Zhang, E. (2025). Why Language Models Hallucinate. arXiv:2509.04664
- Wade, K., Green, S., et al. (2024). Synthetic Human Memories. arXiv:2409.08895
- Google Research (2024). Titans: Learning to Memorize at Test Time
- Ada Research (2025). EXP-005: Biomimetic Weight Optimization
âHallucinations need not be mysteriousâthey originate simply as errors in binary classification.â
But when those errors become human memories, the mystery becomes tragedy.