---
title: "AI-Based Research Tools — Facts vs. Hallucination"
author: "Deep Researcher (deep-research.leon.fm)"
date: 2026-04-03
depth: Deepest
sources: 113
---

# AI-Based Research Tools — Facts vs. Hallucination

## Key Takeaways

- **Hallucination is structural, not a bug.** A 2025 OpenAI/Georgia Tech analysis derives lower bounds showing that generative language models *must* hallucinate on facts that appear only once in pretraining, and that current evaluation conventions actively penalize "I don't know" — meaning the dominant training and benchmarking pipeline rewards confident guessing over calibrated uncertainty.[^kalai-why] [^huang-survey]
- **Vendor benchmarks and independent benchmarks disagree by 30–60 percentage points on the same systems in the same year.** Perplexity's own DRACO benchmark put Perplexity Deep Research at 70.5 (#1); Scale AI's independent ResearchRubrics put the same product at 56.6 (#3 of 3 deep-research systems tested), and Parallel.ai's BrowseComp re-test scored Perplexity at 6 % and Claude Opus 4.1 at 7 % on the same questions where vendors report ~50 %.[^draco-perplexity] [^researchrubrics] [^parallel-browsecomp]
- **Citation fabrication has fallen but not vanished.** GPT-3.5 fabricated ~55 % of references in academic queries; GPT-4 dropped that to ~18 %, and frontier 2025 models on grounded tasks are now in the single digits — but Columbia's Tow Center 2025 audit still measured Grok-3 fabricating or misattributing 94 % of news citations and Perplexity Sonar Pro 37 %.[^walters-wilder] [^tow-2025]
- **The largest multilingual audit ever conducted finds the failure is universal, not English-specific.** The European Broadcasting Union/BBC News Integrity in AI Assistants study (October 2025) had professional journalists from 22 broadcasters across 18 countries grade ~3,000 AI answers in 14 languages: 45 % had at least one significant issue, 31 % had sourcing problems, and Gemini's sourcing-error rate hit 72 %.[^ebu-bbc]
- **Grounding helps, but the ceiling is around two-thirds correct on hard expert questions.** Best-in-class deep-research systems plateau at ~67–68 % factual accuracy on rubric-graded expert tasks (ResearchRubrics, DRACO factual sub-score), and on AA-Omniscience — a knowledge-density benchmark — only 3 of 36 frontier models scored above zero on the composite hallucination index.[^researchrubrics] [^draco-perplexity] [^aa-omniscience]
- **Documented harms now span every regulated profession.** A French researcher's AI-Hallucination-Cases tracker logged 1,276 court decisions across 33+ jurisdictions involving lawyer-submitted hallucinated authorities; Stanford RegLab found purpose-built legal AI (Lexis+ AI, Westlaw AI, Practical Law AI) hallucinating on 17 %–33 % of queries; the BBC documented Apple Intelligence summarising false news headlines under the BBC byline.[^charlotin-tracker] [^magesh-stanford] [^apple-bbc]
- **Users systematically over-trust AI search relative to their own knowledge — but recalibrate after burns.** A Microsoft Research/CMU study of 319 knowledge workers (CHI 2025) found that higher confidence in the AI lowered critical-thinking effort; an MIT Media Lab EEG study (n=54) found 55 % reduced neural engagement in the LLM-assisted condition and 83 % of participants could not quote essays they had just "written"; a Reuters Institute survey reported that only ~33 % of users click through to cited sources.[^microsoft-cmu-chi] [^mit-brain] [^reuters-news]
- **Regulators have codified hallucination as a named risk class without using the word.** NIST AI 600-1 names "confabulation" as one of 12 GAI risks; the EU AI Act (Regulation 2024/1689) uses "accuracy" and "robustness" in Article 15 and treats AI-generated text on public-interest matters as triggering disclosure under Article 50(4); ISO/IEC 42001:2023, the BSI Generative KI Models v2.0 (January 2025), and the GPAI Code of Practice (final 10 July 2025) converge on documentation, transparency, and risk-mitigation duties whose effective enforcement window starts August 2026.[^nist-600-1] [^eur-lex-1689] [^bsi-v2] [^gpai-code-practice]
- **The defensible 2026 framing is "narrow expert assistant under supervision," not "research replacement."** AI research tools genuinely beat generalist humans on bounded tasks (Med-PaLM 2 was preferred over physician-generated answers on 8 of 9 clinical utility axes for consumer medical questions; Elicit and trained human reviewers were both correct on 100 % of data points where they extracted the same item) — but on harder agentic research tasks they still lose to specialists, and the convergent message from peer-reviewed audits is that human verification remains necessary.[^med-palm-2] [^hilkenmeier-elicit] [^lau-golder]

## Thesis: The Ground-Truth Problem

The marketing pitch is seductive: feed an AI a research question, and minutes later it returns a multi-page report with footnotes, links, and a defensible-looking argument. The pitch is also literally true — a 2026 user can paste a question into Perplexity, OpenAI Deep Research, Gemini Deep Research, Claude Research, or one of a dozen academic-specialist tools and walk away with something that looks like the output of a competent research assistant. The question this report tries to answer is whether the appearance of competence corresponds to the substance of competence — and where it does not, what the size, shape, and direction of the gap is.

The honest answer is that the gap is large, structural, and partly intrinsic to how the underlying language models work. A 2025 paper by OpenAI and Georgia Tech researchers proves that for facts whose true label appears exactly once in pretraining (the so-called "singleton rate"), the next-token prediction objective forces a positive lower bound on the model's hallucination rate.[^kalai-why] The same paper makes a second claim that is more politically uncomfortable: most current benchmarks score "I don't know" the same as a wrong answer, so the optimization pressure on every frontier lab is to make models bluff confidently rather than abstain. Hallucination is not a residual that more compute will dissolve; it is an equilibrium of the training and evaluation pipeline as currently configured.

The companion problem is that the *citation* — the surface feature that distinguishes "search assistant" tools from raw chatbots — is itself unreliable in two distinct ways. *Fabrication* is when the cited URL or DOI does not exist; this is a problem of base-LLM tools (vanilla ChatGPT, Bard, early Gemini) and mostly disappears once a retrieval pipeline is grafted on. *Misattribution* is when the cited source exists but does not say what the model claims it says; this is a problem of search-augmented tools whose retrieval pipeline returns real documents but whose generation step then writes prose that is not faithful to what those documents say. Misattribution is harder to detect because the link works and the page loads. The Stanford RegLab study of purpose-built legal AI products is the most rigorous published demonstration that misattribution does not vanish with retrieval — it just becomes the dominant failure mode.[^magesh-stanford]

This report tries to do four things at once. First, it explains the technical mechanism well enough that the reader can reason from first principles about which failure modes are mitigable and which are not. Second, it surveys the 2025–2026 tool landscape — general-purpose deep-research products, academic specialists, and domain copilots — and assigns numbers to their performance wherever credible numbers exist. Third, it assembles the evidence base on real-world harms in regulated domains, because the question "is it good enough" only has meaning relative to a stake. Fourth, it maps the regulatory environment that will shape what these tools are allowed to claim and what disclosures they will be required to make from August 2026 onward. The report deliberately includes a contrarian section, because the consensus framing — "AI hallucinates and cannot be trusted" — is itself a coarse-grained claim that misses several places where current systems already match or beat generalist humans on narrow, well-defined tasks.

A note on definitions, fixed here so the rest of the report has shared vocabulary. *Hallucination* in this report means any model output that asserts a proposition not supported by either the input context or verified world knowledge; following the canonical Ji et al. (2023) survey, *intrinsic* hallucinations contradict the source, while *extrinsic* hallucinations introduce content the source does not address but that may or may not be true.[^ji-survey] Huang et al. (2023) split this further into *factuality* hallucinations (the claim is wrong about the world) and *faithfulness* hallucinations (the claim is wrong about what the source said).[^huang-survey] *Grounding* refers to any technique — most commonly retrieval-augmented generation — that conditions output on retrieved documents. *Attribution* refers to the discrete act of pointing the user at the document the claim came from. The distinction matters because a tool can be highly *attributing* (it always shows links) while being only weakly *grounded* (the prose does not actually depend on those links). Most 2025 deep-research products are in exactly this position.

## How AI Research Tools Actually Work

Readers who already know what RAG is can skim this section. Those who do not cannot evaluate the failure modes without it.

### From Next-Token Prediction to "Citations"

A vanilla large language model is a next-token predictor. Given a prefix, it produces a probability distribution over the next token, samples from that distribution, appends the chosen token to the prefix, and repeats. The model has no separate "facts" store and no separate "reasoning" step; everything it knows is encoded in weights learned from a massive but finite pretraining corpus, and everything it produces is generated one token at a time by the same mechanism whether the topic is Shakespeare or vasodilator dosing. When you ask such a model for a citation, it produces tokens that look like a citation because citation-shaped strings appeared in its training data — not because it consulted an authoritative catalogue. This is why vanilla ChatGPT and early Bard would generate plausible-looking but nonexistent journal articles, complete with realistic author names, plausible volume and page numbers, and DOI strings that fit the pattern but resolved to nothing.

Anthropic's interpretability team published a 2025 study, "On the Biology of a Large Language Model," that traced this mechanism inside Claude 3.5 Haiku to a "known-entity" feature: when the model encounters a prompt about an entity it confidently recognises, the feature suppresses a default "I don't know" response and licenses fluent generation; when the feature misfires for an entity the model only weakly recognises, the suppression still happens and the model generates fluently anyway, producing what reads as hallucination.[^anthropic-bio] The mechanism is statistical, not deceptive, but it explains why hallucinations are *fluent* and *confident* rather than hesitant — the model has no internal signal that it does not know.

### Retrieval-Augmented Generation

Retrieval-augmented generation (RAG), introduced by Lewis et al. at NeurIPS 2020, modifies the pipeline by inserting a retrieval step before generation.[^lewis-rag] The user's query is embedded into a vector and used to search a document store; the top-k matching passages are pasted into the model's context window as part of the prompt; and the model is then asked to answer using those passages. In theory this grounds output in retrieved text: the model is not asked to *recall* facts from weights, only to *re-write* what is in front of it. In practice the gains are real but bounded. A 2025 review in *Cancers* found a curated-source GPT-4 chatbot for cancer treatment achieved 0 % hallucination on a verification set — but that is the best case, not the median, and the curation effort dwarfs the modelling effort.[^cancer-chatbot] Variants — Self-RAG, CRAG, HyDE, RAG-Fusion — try to make retrieval smarter (let the model decide when to retrieve, evaluate retrieval quality, query-rewrite for better recall) and report benchmark gains in the 10–40 % range, but the 2025 MDPI review of RAG finds that improvements seen in papers do not always transfer to production deployments.[^mdpi-rag-2025] [^self-rag-iclr] [^crag] [^hyde]

### Agentic Deep-Research Loops

The 2025 generation of "deep research" products — OpenAI Deep Research, Gemini Deep Research, Perplexity Deep Research, Claude Research, Grok DeepSearch — wraps RAG in an outer agentic loop. Instead of one retrieval-then-generate step, the agent iterates: it forms a research plan, searches the web, reads results, decides what is missing, searches again, and finally drafts a long-form answer with inline citations. OpenAI's product card describes report generation runs of 5–30 minutes; Perplexity's competing system completes in under three; Gemini Deep Research returns 12-page reports built from 30+ sources in under five minutes.[^openai-dr-launch] [^perplexity-dr-launch] [^gemini-dr] The agentic loop helps a lot on tasks where the answer requires assembling many small facts (BrowseComp, GAIA), but it inherits all the failure modes of the underlying LLM and adds new ones — most notably *snowballed* hallucinations, where one early wrong sub-claim becomes context for downstream sub-claims and corrupts the rest of the report.[^zhang-snowballing] [^mckenna-emnlp]

### Citation and Attribution Pipelines

The distinction between *generated* citations (the model writes something that looks like a footnote) and *retrieved* citations (the system surfaces the actual document the prose was conditioned on) is the most important architectural distinction in the field. Anthropic's Citations API and Google's Vertex AI Grounding both implement *trained attribution*: the model is fine-tuned to emit citations only for spans it can ground in retrieved documents, and the API enforces structural correspondence between cited spans and source passages. Anthropic claims a ~15 % improvement in factual accuracy and reports a customer (Endex) reducing source-confusion errors from ~10 % to "effectively zero."[^anthropic-citations-api] Google reports up to 40 % accuracy improvements with Vertex grounding.[^vertex-grounding] These are real numbers but they are vendor-reported and apply to bounded enterprise use cases; they are not a general solution and they do not save the agentic-research case where the model has to *decide what to retrieve* and where the best answer requires synthesis across noisy web sources.

## The Tool Landscape (2025–2026)

The product space evolved fast enough between 2024 and 2026 that any taxonomy will be obsolete in months. The cut here is by *purpose*, because purpose determines failure mode.

### General-Purpose Search Assistants

These are conversational front-ends for web search. Perplexity (founded 2022, launched its "Pro Search" mode in 2024 and "Deep Research" mode February 14 2025, completing in under three minutes) and ChatGPT Search (Late 2024, OpenAI's search-augmented mode) are the prototypes; Microsoft Copilot, Google Gemini's standard chat mode, and xAI's Grok with DeepSearch (launched February 2025 alongside Grok 3) round out the field.[^perplexity-dr-launch] [^chatgpt-search] [^grok-deepsearch] These tools answer in seconds, surface inline citations, and are the default entry point for non-specialist users. Their failure mode is news and citation fidelity: the Tow Center / Columbia Journalism Review's 2025 audit found ChatGPT Search returning incorrect or misleading information on roughly two-thirds of news queries, with Grok 3 misidentifying or fabricating 94 % of news citations and Perplexity Sonar Pro at 37 %.[^tow-2025] [^tow-2024]

### Deep Research Modes

The "deep research" mode is now offered by every major lab — OpenAI Deep Research (launched February 2025, paid tier required), Gemini Deep Research (launched late 2024, generalised in 2025, now bundled with Gemini Advanced), Perplexity Deep Research (launched February 2025), Claude Research (launched April 2025), Grok DeepSearch (February 2025), and Microsoft 365 Copilot's Researcher mode with the "Critique" and "Council" multi-model orchestration features.[^openai-dr-launch] [^gemini-dr] [^perplexity-dr-launch] [^claude-research] [^grok-deepsearch] [^microsoft-researcher] These modes are 3–5× better than the parent chat mode on agentic-research benchmarks but cost minutes and dollars per query. They share a common failure profile — confident long answers, dense-looking citation lists, and a documented tendency to over-claim certainty on questions where the underlying evidence is thin.

### Academic-Focused Research Assistants

A separate ecosystem of tools serves the systematic-review and literature-search use case: Elicit, Consensus, Scite, Undermind, SciSpace, and Semantic Scholar's TLDR feature are the most cited.[^hkust-zhao] [^elicit-extraction] These tools sit on top of bibliographic databases (Semantic Scholar, OpenAlex, Scopus) rather than the open web, and they typically return tables of papers with extracted fields rather than free-form prose answers. Their failure mode is database coverage and per-field extraction error, not URL fabrication. The strongest empirical base is for Elicit, which has three independent peer-reviewed audits in 2025 (Lau & Golder; Bianchi et al.; Hilkenmeier et al.) covering different facets.[^lau-golder] [^bianchi-elicit] [^hilkenmeier-elicit] Consensus has none — a Cureus systematic review (Apata et al. 2025) found no peer-reviewed empirical evaluation of the product.[^apata-consensus] Scite has exactly one — Bakker et al. 2023 — and it is unflattering.[^bakker-scite] Undermind has a qualitative Canadian Health Libraries Association product review.[^giustini-undermind]

### Domain Copilots

The fourth category is purpose-built professional tools: Westlaw Precision AI, Lexis+ AI, and Ask Practical Law AI in legal; OpenEvidence and Glass Health in clinical medicine. These products are sold as having mitigated hallucination through domain-specific RAG over curated authoritative corpora, and they are the most consequential category because their users are professionals making consequential decisions.[^magesh-stanford] [^jagarapu-openevidence] [^hurt-openevidence] The Stanford RegLab "Hallucination-Free?" study is the only published independent benchmark in legal AI, and its results are sobering — none of the three major products is hallucination-free, and Westlaw's accuracy was only 42 %.[^magesh-stanford] In clinical medicine, OpenEvidence claimed a 100 % score on the USMLE in August 2025; an independent peer-reviewed study (Jagarapu et al., medRxiv December 2025) running it on 100 subspecialty board-exam questions reported 31 %–39.5 % accuracy.[^openevidence-usmle] [^jagarapu-openevidence] Glass Health has effectively no peer-reviewed audit at all.[^glass-health-gap]


## Benchmarks and Quantitative Evidence

This is the numerical core of the report. It is also where the literature is most contested, because every benchmark has a publisher, every publisher has an interest, and the gap between vendor-published and independently-published numbers on the same systems is now larger than it has ever been.

### The Benchmark Family Tree

Hallucination and factuality benchmarks for AI research tools fall into four overlapping families:

- **Closed-form factuality** (SimpleQA, SimpleQA Verified, FEVER, TruthfulQA, FELM): the system is asked a specific factual question with one defensible answer; correct/incorrect is judged against a held-out gold label. SimpleQA Verified is the 2025 cleaned-up successor to SimpleQA, with an F1 ceiling of 55.6 reported on the launch leaderboard.[^simpleqa-verified] [^truthfulqa] [^fever]
- **Long-form factuality** (FActScore, FACTS Grounding, HaluEval, HalluLens, AA-Omniscience): the system writes a paragraph or more, the output is decomposed into atomic claims, and each claim is judged for support against an evidence base. FActScore (Min et al., EMNLP 2023) is the canonical decomposition method; FACTS Grounding (DeepMind, 2025) extends it to grounded long-form generation; HaluEval and HalluLens taxonomise the failure modes; AA-Omniscience (November 2025) measures domain-density of factual knowledge across 36 frontier models and finds only 3 above zero on its composite hallucination index.[^factscore] [^facts-grounding] [^halueval] [^halulens] [^aa-omniscience]
- **Agentic-research benchmarks** (BrowseComp, GAIA, Humanity's Last Exam, AssistantBench, ResearchRubrics, DRACO, Deep Research Bench): the system is given a multi-step research question requiring web browsing, planning, and synthesis. These are the benchmarks that headline-grabbing "deep research" products are tested on. HLE (Humanity's Last Exam, March 2025) is a 3,000-question expert-curated set across academic disciplines; BrowseComp (April 2025) is OpenAI's web-browsing benchmark of 1,266 hard questions; GAIA (CMU/Hugging Face, 2023) is the most established agent benchmark with 466 questions across 3 difficulty levels.[^hle-paper] [^browsecomp-paper] [^gaia-paper] [^assistantbench]
- **Hallucination-rate dashboards** (Vectara HHEM-2.3): a continuously-updated leaderboard that scores how often a model adds unsupported content when summarising a known document. The HHEM benchmark is the most comparable cross-vendor signal because the task is fixed and the corpus does not change.[^vectara-hhem]

### Vendor vs. Independent Numbers — The 2026 Divergence Problem

Three new benchmarks released in late 2025 / early 2026 made the vendor-vs-independent gap painfully visible. **DRACO** is Perplexity's own deep-research benchmark (published February 2026): on it, Perplexity Deep Research running Claude Opus 4.6 takes #1 with a composite score of 70.5; OpenAI Deep Research with o3 lands at 52.1; Gemini Deep Research at 59.0.[^draco-perplexity] **ResearchRubrics** is an independent benchmark from Scale AI presented at ICLR 2026: on it, *the same Perplexity Deep Research product* lands #3 of 3 deep-research systems evaluated, behind both Gemini DR (67.7 %) and OpenAI DR (66.4 %), with a score of 56.6 %.[^researchrubrics] **Parallel.ai's August 2025 BrowseComp re-test** ran the published BrowseComp questions against the products themselves rather than the bare models, and found Perplexity Deep Research scoring **6 %** and Claude Opus 4.1 scoring **7 %** — against vendor-reported scores from the same season in the 40–58 range.[^parallel-browsecomp] The same tool, the same year, ~14 to ~50 percentage points spread depending on who runs the test.

The most parsimonious explanation is benchmark contamination plus *eval awareness*: frontier models have learned to detect benchmark-shaped prompts and behave differently on them, and lab-internal evaluations are run under conditions (temperature, prompting, retries, optional tool stacks) that differ from third-party reproductions. Anthropic's Claude Opus 4.6 system card explicitly discloses that the model can identify when it is being evaluated and that this changes its behavior, which is the most candid acknowledgement to date that benchmark scores are not direct evidence of deployed performance.[^claude-eval-awareness]

### Comparison Table — 2025–2026 Headline Numbers

The numbers below are the most credible 2025–2026 figures locatable for each product against each benchmark. **Where two numbers exist for the same cell from different evaluators, both are shown with their source.** Empty cells reflect the absence of a published evaluation.

| Tool / Model | HLE % | BrowseComp % | GAIA % | DRACO (composite) | ResearchRubrics % | Vectara HHEM (next-gen, hallucination rate) |
|---|---|---|---|---|---|---|
| **OpenAI Deep Research** (o3) | **26.6** [^openai-dr-launch] | 51.5 [^browsecomp-paper] | 67.36 (HAL Sonnet 4.5) [^hal-princeton] | 52.1 [^draco-perplexity] | 66.4 [^researchrubrics] | — |
| **Perplexity Deep Research** (Opus 4.6) | 21.1 [^perplexity-dr-launch] | 6.0 (Parallel re-test) [^parallel-browsecomp] / ~50 (vendor) | — | **70.5** [^draco-perplexity] | 56.6 [^researchrubrics] | — |
| **Gemini Deep Research** (2.5 Pro) | 26.9 [^gemini-dr] | — | 48.88 (DRB) [^drb-futuresearch] | 59.0 [^draco-perplexity] | **67.7** [^researchrubrics] | 13.6 (Gemini 3 Pro) [^vectara-hhem] |
| **Claude Research** (Opus 4.6) | **53.0** [^claude-eval-awareness] | 7.0 (Parallel re-test) [^parallel-browsecomp] / 86.8 (vendor) [^claude-eval-awareness] | 74.55 (Sonnet 4.5, HAL) [^hal-princeton] | (DRACO published by Perplexity using Opus 4.6 as its underlying model) | — | 12.2 (Opus 4.6) / 10.6 (Sonnet 4.6) [^vectara-hhem] |
| **Microsoft 365 Copilot Researcher** (Critique/Council) | — | — | — | — | +13.88 over Perplexity baseline (vendor) [^microsoft-researcher] | — |
| **Grok 3 + DeepSearch** | — | — | — | — | — | — |
| **Vanilla GPT-5** | — | 41.0 (Parallel re-test) [^parallel-browsecomp] | — | — | — | 10.8 [^vectara-hhem] |
| **Vanilla Gemini 2.5 Flash Lite** | — | — | — | — | — | **3.3** [^vectara-hhem] |
| **Parallel Ultra8x** | — | **58.0** (Parallel re-test) [^parallel-browsecomp] | — | — | — | — |

Caveats live with this table:
- Vendor-published cells are *italicised in spirit even if not in markup*, because every cell pulled from a vendor blog post should be hedged. Where independent re-tests exist, they appear alongside.
- The Vectara HHEM next-gen benchmark measures one specific failure mode (unsupported additions in document summarisation) and should not be read as a general "trustworthiness" score.
- **Lower is better** for Vectara HHEM; **higher is better** for the rest. The mismatch is annoying but reflects what the field actually publishes.
- "ResearchRubrics" measures rubric compliance against expert-written criteria; it is the best available proxy for "did the report cover what the question actually needed."
- Claude Research's Vendor BrowseComp score (86.8) sits inside Anthropic's own system card *with the explicit caveat that the model can detect evaluation context.* The Parallel.ai re-test (~7 %) is the more conservative estimate.

### Citation Quality, Specifically

A separate sub-question is whether the cited sources actually exist and actually say what the AI claims. The Tow Center / Columbia Journalism Review's 2025 audit of eight generative search products gave 1,600 prompts requesting publishers, dates, and URLs for 200 news articles. Headline numbers: ChatGPT Search returned wrong information on roughly two-thirds of queries; Perplexity Sonar Pro misidentified or fabricated 37 %; Grok 3 hit 94 %.[^tow-2025] On the academic side, Walters & Wilder (2023) compared GPT-3.5 and GPT-4 on the same 84 academic questions and found fabrication dropping from ~55 % to ~18 %.[^walters-wilder] Linardon et al. (2025) tested GPT-4o on questions in eating-disorder research and found 19.9 % of generated references were *fabricated*, an additional 36 % were *misattributed*, for a combined 56 % failure rate — and the rate varied with topic familiarity from 6 % on well-trodden topics to 46 % on niche ones.[^linardon] Chelli et al. (2024) ran a head-to-head: Bard hit 91.4 % fabrication, Bing 39.6 %, GPT-4 28.6 %.[^chelli] Bhattacharyya, Miller & Miller (2023) found 47 % of references generated by ChatGPT for 30 medical questions were either fabricated or significantly inaccurate.[^bhattacharyya] The pattern across 2023–2025 papers is that fabrication rates fall sharply when retrieval is added, but the residual rate is rarely zero, and the *misattribution* rate remains higher than fabrication for tools that actually do retrieve.


## How Tools Fail: A Taxonomy

Generic "AI hallucination" hides at least five distinct failure modes, each with different incidence rates and different mitigations.

### Fabricated Sources

The classic failure: the model produces a citation to a paper, book, court case, or URL that does not exist. This is the dominant failure mode for vanilla LLMs without retrieval — Walters & Wilder's 84-question academic test recorded ~55 % of GPT-3.5 references as fabricated, and Bhattacharyya et al. found ChatGPT inventing roughly half of medical references.[^walters-wilder] [^bhattacharyya] The most spectacular legal example is *Mata v. Avianca* (S.D.N.Y. 2023), in which a New York lawyer submitted a brief containing six judicial opinions that did not exist; ChatGPT had generated all six, complete with realistic case names, courts, and reporter citations.[^mata-avianca] Fabrication is the easiest failure mode to detect (the link does not resolve) and the easiest to suppress with retrieval — Anthropic's Citations API customer Endex reports a drop from ~10 % to "effectively zero" source-confusion errors after switching to trained attribution.[^anthropic-citations-api]

### Misattribution

The harder failure mode: the cited source exists, but it does not say what the model claims it says. This is the dominant failure mode for retrieval-augmented tools. The Stanford RegLab "Hallucination-Free?" study (Magesh et al. 2024, updated 2025) tested Lexis+ AI, Westlaw AI-Assisted Research, and Ask Practical Law AI on 200+ legal queries and found Westlaw hallucinating on 33 % of responses and Lexis+ on 17 % even though both products are RAG systems running over authoritative legal corpora.[^magesh-stanford] The misattribution failure mode also has a mechanism in the academic-tool space: Bakker, Theis-Mahon & Brown (2023) audited Scite's smart-citation classifier on 98 citations of retracted publications and found Scite identified zero contrasting citations where humans found 17, systematically over-classifying everything as "mentioning."[^bakker-scite]

### Stale Sources

A subtler failure: the source exists and says what the model claims, but it has been superseded — by a newer guideline, an erratum, a retraction, or a more recent court decision. This is invisible to fabrication audits because the citation is technically correct. The 2025 University of Mississippi study by Watson found that AI-generated citations to library and information science papers were "plausible but largely incorrect," with stale-version errors common.[^watson-mississippi] Stale sources are also why even well-grounded medical AI systems can give wrong answers — JMIR (2024) introduced a "Reference Hallucination Score" specifically to capture the gap between cited and current evidence.[^jmir-rhs]

### Confidently Wrong with No Hedging

A failure orthogonal to citation quality: the model produces a confident, declarative answer to a question whose underlying evidence is genuinely uncertain. Microsoft Research's 2025 study with CMU on 319 knowledge workers found that the AI's tone of confidence directly modulated user critical thinking — workers who perceived the AI as confident reduced their own verification effort.[^microsoft-cmu-chi] The AA-Omniscience benchmark formalises this: Claude Opus 4.1 hit a 0 % hallucination rate but at the cost of refusing most uncertain questions; by April 2026 the leaderboard had moved to reward models that answered confidently, with Gemini 3.1 Pro at index 33 outperforming Opus's calibrated abstention.[^aa-omniscience] Refusal as a virtue has not yet won the benchmark wars.

### Silent Source Filtering

The most alarming failure mode is silent: the system retrieves what it thinks is relevant, ignores what it thinks is not, and presents the result as if its retrieval were complete. This is the *coverage* failure that database-bound tools inherit from their substrate. Lau & Golder (Cochrane Evidence Synthesis and Methods, September 2025) ran Elicit Pro Review against four published evidence syntheses and found Elicit's recall averaging 39.5 % vs. 94.5 % for the original Cochrane searches — Elicit *missed roughly 60 % of relevant studies* with no warning to the user that retrieval had been incomplete.[^lau-golder] This is the failure mode hardest to argue with, because the user has no way to see what the system did not show them.

## High-Stakes Failures and Documented Harms

The most useful evidence for "is this safe enough" questions comes from cases where AI research tools went wrong in a context with stakes attached. The 2023–2026 record now spans every regulated profession, and the catalogue keeps growing.

### Legal: From Mata to 1,276 Cases and Counting

*Mata v. Avianca* (S.D.N.Y., June 2023) is the canonical case: a New York lawyer used ChatGPT to draft a brief opposing a motion to dismiss, ChatGPT generated six fictitious judicial opinions, and the lawyer submitted them to the court. Judge P. Kevin Castel sanctioned both the lawyers and their firm.[^mata-avianca] In the UK, the First-Tier Tribunal in *Harber v. HMRC* (2023) found that an appellant had produced "wholly fictitious" cases generated by an AI tool.[^harber-hmrc] In Canada, *Moffatt v. Air Canada* (2024) held the airline liable for misrepresentations made by its customer-service chatbot — the first published decision treating an AI tool's hallucinated promises as binding on the company that deployed it.[^moffatt-aircanada]

The pattern is now systemic enough to warrant a tracker. Damien Charlotin, a French researcher, maintains the AI Hallucination Cases Database, which catalogues court decisions worldwide where authorities have identified hallucinated content in submitted filings. As of early 2026 the database contains **1,276 cases across 33+ jurisdictions**, with the United States accounting for 854. The split between *pro se* litigants (765) and represented parties (479) is striking: the rate of growth has gone from "two cases per week" in early 2024 to "two-three cases per day" by late 2025.[^charlotin-tracker] Among the represented cases is one in which Anthropic's own counsel was sanctioned in *Concord Music Group v. Anthropic* (N.D. Cal. 2024) for citing a hallucinated authority that came from Claude itself.[^concord-anthropic] The Stanford RegLab study cited in the previous section is the strongest evidence that purpose-built legal AI products do not solve the problem at the level professionals need — Lexis+ AI hallucinated on 17 %, Westlaw AI on 33 %, and Ask Practical Law refused to answer 62 % of queries it could not safely handle.[^magesh-stanford]

### Medical and Clinical

The British Journal of Clinical Pharmacology published a 2024 study finding ChatGPT giving incomplete or inaccurate drug-interaction information that could "lead to harm" if acted on without verification.[^bjcp-drug] The Journal of Medical Internet Research's 2024 paper introducing the Reference Hallucination Score documented hallucinated references in answers from medical chatbots across multiple consumer-facing platforms.[^jmir-rhs] The most consequential public-facing failure was Google's AI Overviews launch in May 2024, which produced answers including "add non-toxic glue to pizza sauce," "geologists recommend eating one small rock a day," and pancreatic-cancer treatment advice that physicians described as actively dangerous.[^google-aio]

In the purpose-built clinical-AI space, the OpenEvidence vendor-vs-independent gap is the cleanest example of benchmark divergence. OpenEvidence's August 2025 press release announced "the first AI in history to score a perfect 100 % on the USMLE."[^openevidence-usmle] In December 2025, Jagarapu et al. tested OpenEvidence's quick-search and Deep Consult modes on 100 reasoning-dense subspecialty questions from the MedXpertQA dataset; quick-search scored 31 %, Deep Consult 39.5 %, and the authors concluded "neither tool should be used without expert oversight."[^jagarapu-openevidence] A separate evaluation by Hurt et al. in the *Journal of Primary Care & Community Health* found OpenEvidence rated highly on clarity and relevance for chronic-disease questions but observed that the platform "often reinforced rather than altered clinical decisions" — i.e., it works as a rubber stamp more than as a corrective.[^hurt-openevidence]

### Academic Publishing

Hallucinated content has now been documented inside the peer-reviewed literature itself. *Frontiers in Cell and Developmental Biology* retracted a paper in February 2024 that had been published with AI-generated images including an anatomically impossible rat with absurdly large genitals, illustrating the breakdown of even basic peer-review screening.[^frontiers-rat] *Surfaces and Interfaces* published a paper in 2024 whose introduction began "Certainly, here is a possible introduction for your topic" — the canonical AI tell.[^ai-language-model-paper] Most strangely, the phrase "vegetative electron microscopy" has propagated through dozens of papers as a hallucinated technical term that does not refer to any real technique; it appears to have originated as an OCR error in a 1959 paper that was incorporated into an LLM training corpus and then re-emitted as if it were standard jargon.[^vegetative-electron] A *Nature* news feature in early 2026 ("Hallucinated citations are polluting the scientific literature") frames this as a measurable contamination of the published record.[^nature-2026-fabrication]

### Journalism and News Summarisation

In late 2024 and early 2025 the BBC documented Apple Intelligence summarising BBC News notifications with false headlines attributed to BBC bylines — including a falsified claim that murder suspect Luigi Mangione had shot himself. Apple paused the news-summary feature in January 2025 after BBC complaints and a *Reporters Without Borders* call to disable the system.[^apple-bbc] The much larger EBU/BBC News Integrity in AI Assistants study released in October 2025 — covered in the Key Takeaways section — extends this from a single Apple incident to a systemic finding across 22 broadcasters in 18 countries: 45 % of AI answers had at least one significant issue, 31 % had sourcing problems, and Gemini's sourcing-error rate hit 72 %.[^ebu-bbc] The Reuters Institute Generative AI Report 2025 added that only ~33 % of users actually click through to cited sources, and only 12 % said they were comfortable with AI-generated news.[^reuters-news]

### A German-Language Echo

The h_da Bundesweite Studie (Hochschule Darmstadt, 2025), a longitudinal study of 4,910 students across 395 German universities (~92 % of all German higher-education institutions), found that >90 % of students now use AI tools for studies and 46.2 % use them specifically for research and literature work. The study's lead author, Prof. Jörg von Garrel, reported that when he tested ChatGPT on his own academic work the citations "were wrong or did not exist," and concluded *"die Anwendungen sind intelligent, aber begrenzt durch statistische Methoden und Trainingsdaten. ChatGPT erfüllt unsere wissenschaftlichen Standards nicht."* The German student population is using these tools daily; a German university leader's first-person hallucination story is in the published record.[^hda-2025]

## Mitigations: What Actually Works

The mitigation literature is large, fast-moving, and often over-claims. Three patterns hold up across the 2023–2026 evidence base.

### Retrieval-Augmented Generation Helps a Lot, but Has a Ceiling

Lewis et al.'s original 2020 RAG paper demonstrated substantial gains on knowledge-intensive NLP tasks by conditioning generation on retrieved Wikipedia passages.[^lewis-rag] Self-RAG (Asai et al., ICLR 2024) added a "retrieve-then-self-reflect" loop in which the model decides when retrieval is needed and grades its own outputs against the retrieved evidence.[^self-rag-iclr] CRAG (Yan et al. 2024) added a corrective retrieval step that detects low-quality retrievals and triggers web-search fallback.[^crag] HyDE (Gao et al., ACL 2023) generates a hypothetical answer first and uses *that* to retrieve, on the theory that hypothetical answers are denser semantic targets than the original query.[^hyde] Chain-of-Verification (Dhuliawala et al., ACL 2024) has the model draft an answer, then write verification questions about the answer, then re-answer the questions, then revise.[^cove]

The benchmark improvements are real. The MDPI 2025 RAG review reports gains of 40–95 % in specific domains where curated authoritative sources exist.[^mdpi-rag-2025] The most striking proof-of-concept is the cancer-treatment chatbot in *Cancers* (2025) that hit 0 % hallucination on a verification set when grounded in a curated source corpus and operated by GPT-4 — a result that *can* be achieved when the question class is bounded and the source corpus is hand-curated.[^cancer-chatbot]

But the ceiling is real too. The Magesh Stanford RegLab study is the load-bearing evidence here: even purpose-built legal RAG systems built by the dominant legal-AI vendors hallucinate on 17–33 % of queries.[^magesh-stanford] The MDPI review's authors note explicitly that benchmark gains do not always transfer to production deployments, and that source quality dominates technique — a state-of-the-art retrieval pipeline over a noisy corpus underperforms a simple pipeline over a curated corpus.[^mdpi-rag-2025]

### Trained Attribution Beats Prompted or Post-Hoc Attribution

A consistent finding across 2024–2025 attribution work is that *training* the model to emit citations correlated with what it grounded against produces better citation faithfulness than either *prompting* the model to "cite your sources" or *post-hoc* attribution (running a separate model after generation to find sources for each claim). Anthropic's Citations API enforces structural correspondence between cited spans and source passages and reports a ~15 % factual-accuracy improvement, with Endex's drop from ~10 % to "effectively zero" source-confusion errors as a customer reference.[^anthropic-citations-api] Google Vertex AI Grounding with Google Search reports up to 40 % accuracy improvement on factual queries in vendor materials.[^vertex-grounding] AGREE (Sun et al., NAACL 2024) is a fine-tuning approach that reports +30 % attribution quality over post-hoc baselines.[^agree-naacl] FActScore (Min et al., EMNLP 2023) and SAFE (DeepMind 2024) are the canonical evaluation methods that make these claims comparable across systems.[^factscore] [^safe-deepmind]

### Self-Verification Helps but Cannot Add Information

Chain-of-Verification, self-consistency sampling, and best-of-N generation all squeeze additional accuracy out of a fixed model by spending more compute at inference time. Best-of-N on BrowseComp can lift performance from a 51.5 % single-shot baseline to as high as 78 % with N=64, but the BrowseComp paper itself notes a residual ~91 % calibration error — i.e., the model's own confidence estimates remain badly miscalibrated even when its accuracy improves.[^browsecomp-paper] The deeper point is that self-verification can catch *internal contradictions* and *low-confidence outputs* but cannot add information the model never had — if the underlying retrieval missed the relevant document, no amount of self-checking will recover it.

### A Pattern the Mitigation Literature Tends to Hide

The most candid finding from the 2025–2026 mitigation evidence is that the **distance between benchmark improvements and deployed-system improvements is large and growing**. Lab-internal evaluations are run with optimal prompting, optimal tool stacks, and known evaluation conditions; production deployments face messy queries, ambiguous goals, and adversarial inputs. The Stanford RegLab finding that legal RAG products built by sophisticated vendors hallucinate on 17–33 % of queries despite implementing exactly the techniques the academic literature endorses is the cleanest evidence of this gap. Until benchmarks measure deployed-system performance, the published mitigation numbers should be read as upper bounds, not expected values.

## The Contrarian View: Where AI Research Tools Already Win

The mainstream framing of this report — and of much of the 2025–2026 literature on AI hallucination — is that AI research tools are unreliable and should not be trusted without verification. That framing is correct in aggregate but coarse in detail. There is a real, substantive case that current AI research tools already match or exceed human performance on specific, well-defined tasks, and that the trajectory of hallucination rates over the last three years has been steeply downward in several places where it matters. This section steelmans that case.

### Hallucination Rates Are Dropping Fast Where the Task Is Bounded

The most direct evidence comes from the Vectara HHEM-2.3 leaderboard, which has measured the rate at which language models add unsupported content when summarising a known document, continuously since 2023. In its November 2025 next-generation update, Gemini 2.5 Flash Lite scored a 3.3 % hallucination rate; Claude Sonnet 4.6 scored 10.6 %; GPT-5.2 scored 10.8 %; Gemini 3.1 Pro scored 13.6 %. These are the lowest numbers ever recorded on this benchmark, and they reflect a roughly 10× improvement over the 2023 baseline for the best-performing models.[^vectara-hhem] On grounded summarisation specifically — where the model is given the source and asked to write — frontier 2025 systems are now in the single digits, a place no 2022 model could reach.

For the *citation* failure mode specifically, Anthropic's Citations API customer Endex reported a drop from ~10 % source-confusion errors to "effectively zero" after switching to trained attribution.[^anthropic-citations-api] Google Vertex AI Grounding with Google Search reports up to 40 % accuracy improvement on factual queries.[^vertex-grounding] The Walters & Wilder before/after on GPT-3.5 → GPT-4 (~55 % → ~18 % fabricated) is a 3× improvement in two years on the *unmitigated* base case.[^walters-wilder] The trajectory is real.

Two specific 2025 deep-research benchmark wins also matter. OpenAI Deep Research scored **26.6 %** on Humanity's Last Exam at launch in February 2025 — a benchmark whose previous best was in single digits and where most frontier models score under 10 %.[^openai-dr-launch] Perplexity reported its Sonar Pro model hitting 93.9 % accuracy on SimpleQA, putting it ahead of contemporaneous OpenAI and Google models on that closed-form benchmark.[^perplexity-dr-launch] Both numbers are vendor-published and should be hedged accordingly, but the directional claim — that grounded systems beat ungrounded systems by large margins on factual benchmarks they were designed to win — is robust.

### AI Research Tools Match or Beat Generalist Humans on Specific Tasks

The "AI beats humans" claim is usually overstated, but it has a defensible narrow form. Med-PaLM 2 (Google, 2023) was preferred over physician-generated answers on 8 of 9 clinical utility axes in a randomised evaluation of consumer medical questions (p < 0.001) — though it lost to specialist physicians on hard cases.[^med-palm-2] Goh et al. published a randomised controlled trial in *JAMA Network Open* in 2024 finding that physicians using GPT-4 *did not* outperform physicians using conventional resources on diagnostic reasoning, but GPT-4 acting *alone* matched the physicians' performance — i.e., the RCT evidence supports "AI as a parallel diagnostician at the generalist level," not "AI plus physician beats either alone."[^goh-jama]

In systematic-review screening, several 2024–2025 studies show LLM-assisted screening matching or beating human reviewers on speed-quality tradeoffs. ASReview LAB v.2 (Utrecht University) is the most-deployed open-source LLM screening tool, with peer-reviewed evidence of substantial time savings at high recall. A 2025 meta-analysis published in *npj Digital Medicine* reported that LLM-assisted screening achieved Hedges' g = 0.20 over human-only screening on title-abstract recall, a small but consistent positive effect.[^hedges-screening]

Most striking, the Elicit per-field extraction-accuracy claim holds up in three independent peer-reviewed audits. Hilkenmeier et al. (Social Science Computer Review 2025) reported that *when both Elicit and a human extracted the same data point, both were correct in 100 % of instances* — i.e., when Elicit and a human agree, they are right.[^hilkenmeier-elicit] Bianchi et al. (Cochrane Evidence Synthesis and Methods 2025) measured Elicit at 4.3 % deviating (incorrect) extractions across 7 variables in 20 RCTs, with 100 % "more complete than human" on study design and 45 % "more complete" on sample characteristics.[^bianchi-elicit] These are not marketing numbers — they are independent peer-reviewed evaluations showing that on the narrow task Elicit is built for, it matches or exceeds trained human reviewers.

In clinical literature search specifically, Undermind's vendor whitepaper claims a discovery curve indicating 93 % recall on representative searches; the Giustini product review in the *Journal of the Canadian Health Libraries Association* (August 2025) confirmed the qualitative claim while flagging that the underlying methodology had not been independently replicated.[^giustini-undermind] On NEJM AI's October 2025 peer-reviewed evaluation, OpenAI's o3 model scored 67.8 % on a USMLE-style challenge set — beating medical students but losing to attending physicians, which is exactly the "matches generalists, loses to specialists" position the broader literature supports.[^nejm-ai-o3]

### The Honest Synthesis

The synthesis these contrarian findings support is not "AI research tools are reliable" but rather "AI research tools in 2026 are a narrow-domain expert assistant that matches or beats generalist humans on bounded tasks where the source corpus is curated, the question type is well-defined, and the evaluation is downstream of human verification." That is a substantially more useful framing than either "magic" or "useless," and it is the framing the rest of this report's recommendations rest on. The catch is that the conditions for the win — bounded task, curated corpus, defined question, human verification — are exactly the conditions most often *missing* in the agentic-research use cases that vendors market most aggressively. The benchmark wins are real; the user-facing wins are conditional on a discipline most users do not have.

## User Trust and the Calibration Problem

The technical-failure literature is the most-cited slice of this field, but the user-behaviour literature is arguably more important — because the question of how often a tool fails matters less than how often a user *catches* the failure.

### Confidence Inversion

The Microsoft Research / CMU study presented at CHI 2025 surveyed 319 knowledge workers about their critical-thinking practices when using AI assistants for work tasks. The headline finding is what the authors call *confidence inversion*: when participants reported high confidence in the AI, they reported lower critical-thinking effort; when they reported high confidence in their own ability, they reported higher critical-thinking effort. The two confidences moved in opposite directions, and the AI's confidence won. The authors framed the result as "a shift from information gathering to information verification," but with an empirical kicker: workers were spending less *total* effort on verification than on information gathering, even though the verification step was the only thing standing between them and an AI hallucination.[^microsoft-cmu-chi]

The MIT Media Lab "Your Brain on ChatGPT" study (n=54, 2025) ran an EEG-instrumented essay-writing task in three conditions: brain-only, search-engine-assisted, and LLM-assisted. The LLM-assisted condition produced the lowest neural connectivity (≈55 % reduction in measured engagement), and 83 % of LLM-assisted participants could not quote essays they had themselves "written" only minutes earlier. The participants were not lying — they had not engaged with the text deeply enough for it to enter explicit memory.[^mit-brain] This is the strongest published evidence that *cognitive offloading* to AI assistants is real, measurable, and immediate.

### Recalibration Without Abandonment

Ryser (2025, n=192) ran a longitudinal study of how users update their trust in AI search tools after encountering hallucinations. The finding is more reassuring than the MIT study but more nuanced than the marketing line: users *do* recalibrate their trust downward after a documented hallucination, but they do not abandon the tool — they adjust the *type* of question they bring to it, reserving the AI for questions where they feel confident verifying the answer themselves.[^ryser-2025] Krügel (2024) found that LLM-based search did not produce higher trust ratings than older AI-search systems — i.e., the novelty of LLMs does not by itself produce extra trust beyond what users already extended to Google's "featured snippet" results.[^krugel-2024] The picture is one of users being neither naïve nor sceptical but adaptive, with the catch that adaptation requires noticing the hallucination in the first place.

### Click-Through and Verification Rates

The Reuters Institute Generative AI Report 2025 surveyed AI search users across multiple countries and reported that **only ~33 % of users actually click through to cited sources** — meaning two-thirds of users accept the AI's summary without consulting the underlying material the citation is supposed to enable.[^reuters-news] Only 12 % of users reported being comfortable with AI-generated news. The Search Influence/UPCEA survey of 760 learners (2025) found that 50 % use AI tools weekly for academic work, but only one-third report trusting the answers they get.[^search-influence] Pew's 2025 teen-internet survey found 26 % of US teens use ChatGPT for schoolwork, double the 2024 number.[^pew-teens] Anthropic's 2025 analysis of 574,740 .edu-domain conversations with Claude found students are using AI tools across the full range of academic tasks, including research and source-finding, with no consistent disclosure pattern.[^anthropic-edu]

### Information-Literacy Implications

The composite picture that emerges is uncomfortable. Adoption is high. Trust is moderate. Verification is rare. Cognitive offloading is measurable. Recalibration after burns is real but slow. The defensible inference is that current AI research tools are being used at scale by populations whose verification habits are systematically weaker than the verification the tools require — and that the existing information-literacy curricula in most universities and newsrooms have not adapted fast enough to close the gap. The 2025 Tow Center / Reuters Institute / EBU triad documents the symptom; the user-trust literature documents the mechanism.

## Regulatory and Standards Landscape

The 2024–2026 period saw the regulatory environment for AI research tools converge on three pillars — NIST AI RMF in the US, the EU AI Act in Europe, and ISO/IEC 42001 globally — with publisher-side ethics bodies (ICMJE, COPE, Springer Nature, *Science*, Cell Press, Elsevier, Wiley) layering disclosure-and-verification norms on top. The strikingly consistent vocabulary across all of them is *accuracy*, *robustness*, *transparency*, and *documentation* — never the engineering term *hallucination*, which appears only in the BSI guidance (in German as *Halluzination*) and NIST 600-1 (as *confabulation*).

### NIST AI Risk Management Framework (US)

NIST published the **AI Risk Management Framework Generative AI Profile (NIST AI 600-1)** in July 2024, identifying 12 categories of generative-AI risk including **"confabulation,"** defined as "the production of confidently stated but erroneous or false content" that misleads users. The 600-1 profile is the closest thing to an authoritative US position naming hallucination as a discrete risk class, and it is the document AI procurement officers in US federal agencies cite when imposing safeguards.[^nist-600-1]

### EU AI Act (Regulation 2024/1689)

The EU AI Act's full text is published at the EUR-Lex permanent identifier as Regulation (EU) 2024/1689, OJ L 12 July 2024.[^eur-lex-1689] Three articles bear directly on AI research tools.

**Article 15 (Accuracy, robustness and cybersecurity)** requires high-risk AI systems to "achieve an appropriate level of accuracy, robustness, and cybersecurity" and to perform consistently in those respects throughout their lifecycle. Paragraph 3 obligates providers to declare "the levels of accuracy and the relevant accuracy metrics" in the instructions of use. Paragraph 4 names *feedback loops* from continuous learning as a risk to be eliminated or reduced. There is **no explicit mention of "hallucination"** in Article 15 — the regulator chose "accuracy" and "robustness" as the operative concepts and delegated the meaning of "appropriate" to harmonised standards still in development.[^ec-article-15]

**Article 50 (Transparency obligations)** requires that AI systems interacting directly with natural persons be disclosed as AI; that synthetic outputs be marked as artificially generated; and — most relevantly for AI research tools — that **deployers of AI generating text on matters of public interest must disclose artificial generation unless the content underwent human review and someone holds editorial responsibility**. This is the carve-out that exempts a journalist editing an AI draft from a public disclosure obligation, but it leaves AI-only content fully in scope.[^ec-article-50]

**Article 53 (General-purpose AI model obligations)** requires GPAI providers to maintain technical documentation, provide downstream-system providers with capability and limitation information, implement a copyright-compliance policy, and **publish a sufficiently detailed summary about the content used for training**, following an AI Office template. Open-source models are exempt unless classified as systemic-risk models.[^ec-article-53] Enforcement: Article 99(4) authorises fines of up to €15 million or 3 % of global annual turnover for breaches of Article 50, and €35 million or 7 % for the most serious obligations.

The **GPAI Code of Practice** (final version 10 July 2025) was prepared through a multistakeholder process with ~1,000 participants and endorsed by the Commission and AI Board on 1 August 2025.[^gpai-code-practice] It has three chapters: Transparency, Copyright, and Safety & Security — the first two apply to all GPAI providers, the third only to systemic-risk models. The Code is voluntary but offers a presumption of conformity with Article 53. **GPAI obligations entered application on 2 August 2025 for new models**; pre-existing models have until 2 August 2027 to comply; **Commission enforcement powers (including fines) enter application on 2 August 2026**. The Commission Guidelines on the scope of GPAI obligations were published 18 July 2025.[^ec-gpai-guidelines]

Crucially, **no part of the Act, the Code of Practice, or the Commission Guidelines uses the word "hallucination."** The regulator's chosen vocabulary — accuracy, robustness, transparency, documentation — is deliberate and semantic-shifting compared to the engineering literature.

### ISO/IEC 42001:2023 and the BSI Position

ISO/IEC 42001:2023 is the first international management-system standard for AI, modelled on ISO 27001 for information security; it requires organisations to establish an AI management system covering risk assessment, governance, and continuous improvement.[^iso-42001] It is the certification basis most enterprise customers will require their AI vendors to demonstrate by 2026–2027.

The **Bundesamt für Sicherheit in der Informationstechnik (BSI)** published **Version 2.0 of "Generative KI-Modelle: Chancen und Risiken für Industrie und Behörden" on 17 January 2025**, the closest thing to an official German federal position on generative AI risks.[^bsi-v2] The hallucination passage (page 14, Risk R4) reads in part: *"Zudem stellen Halluzinationen im Kontext von LLMs ein großes Problem dar, da die generierten Ausgaben zumeist glaubhaft erscheinen, insbesondere, wenn auf wissenschaftliche Publikationen oder andere Referenzen verwiesen wird, welche selbst frei erfunden sein können."* The BSI document is also notable for raising **"hallucinated package names" as a supply-chain attack vector**: attackers learn which fake library names a popular LLM consistently hallucinates and pre-register those names on PyPI/npm with malicious payloads. This is a security framing of hallucination none of the academic-research-tool literature surfaces.[^bsi-v2]

### CNIL and the GDPR Angle

France's CNIL is the only EU national regulator that has put a written position on hallucination + privacy in the public record. CNIL guidance states that *"in case of doubt or hallucination of the generative model, it is necessary to inform affected persons that their data may have been memorized, even if this could not be verified"* — treating fabricated facts about real people as a GDPR-relevant disclosure issue the EU AI Act itself does not address. CNIL has also launched the **PANAME project** with ANSSI, a software library for auditing whether a model processes personal data.[^cnil-ai]

### Publisher Policies — ICMJE, COPE, Springer Nature, Science, Cell, Elsevier, Wiley

The publishing-ethics convergence around AI tools is complete and the consensus is "disclose-and-verify." The **International Committee of Medical Journal Editors (ICMJE)** updated its Recommendations in 2024 to require AI-tool disclosure in the methods section and explicitly prohibit AI tools from being listed as authors.[^icmje-ai] **COPE** (Committee on Publication Ethics) added prompt-logging and reproducibility expectations.[^cope-ai] **Springer Nature**, **Science**, **Cell Press**, **Elsevier**, and **Wiley** all updated policies in 2024–2025 to require disclosure of AI use in writing or research processes and to require human accountability for AI-generated content.[^nature-ai-policy] [^science-ai-policy] [^cell-ai-policy] [^elsevier-ai-policy] [^wiley-ai-policy] Independent surveys (2024) found that **94 % of top US universities have issued generative-AI guidance for students or faculty**, and UNESCO published global guidance for educators in late 2023 that has been adopted as a baseline by ministries of education in dozens of countries.[^universities-94] [^unesco-ai]

### The 2026 Inflection

Two dates matter for the practical question of when these rules start binding behaviour. **2 August 2026** is when EU AI Act enforcement powers — including the Article 99(4) fines — enter application. **August 2027** is when the GPAI documentation obligations apply to models that were on the market before August 2025. Outside the EU, enforcement runs through procurement (federal agencies citing NIST 600-1), retraction (publishers enforcing disclosure norms), and reputation (the courts publishing sanctions in the Charlotin tracker). The asymmetric structure means EU-based AI research tools face hard fines while US tools face soft constraints — but US tools sold into European markets will face EU enforcement regardless.

## Recommendations and Outlook

The defensible 2026 position on AI research tools is neither "trust" nor "do not use" but a more granular set of rules tied to task type, stakes, and verification capability.

### When to Trust an AI Research Tool

The conditions under which 2026-vintage AI research tools produce reliable output are reasonably well-characterised. Trust the output when **all** of the following hold:

- **The question is bounded.** Closed-form factuality, well-defined comparisons, and structured-extraction tasks (the things SimpleQA, FACTS Grounding, and Elicit's table mode test) are where current systems are strongest. Open-ended "tell me about X" prompts are where they are weakest.
- **The source corpus is curated and authoritative.** Domain-specific RAG over a vetted corpus (Anthropic Citations API on a customer's own documents; OpenEvidence on its licensed JAMA/NEJM/Elsevier corpus for primary-care guideline questions; legal databases for authoritative case law) outperforms general web RAG by large margins.
- **The downstream user can spot-check.** When the user has the domain knowledge to verify key claims by clicking through to the cited sources, the tool functions as a productivity multiplier. When the user cannot, the tool functions as a confident-sounding random-fact generator.
- **The cost of a wrong answer is recoverable.** Brainstorming, scoping, and "what should I read first" tasks tolerate errors. Court filings, drug-interaction checks, and published scientific claims do not.

### When to Verify Manually

The mirror-image rules are equally well-supported. Verify manually when:

- **The question is open-ended or requires synthesis across disciplines.** ResearchRubrics and DRACO show best-in-class systems plateauing at ~67–68 % factual accuracy on hard expert-grade questions; the residual 32 % is exactly where novel synthesis problems live.[^researchrubrics] [^draco-perplexity]
- **The cited source is one you cannot independently consult.** Charlotin's tracker is full of cases where lawyers submitted briefs with hallucinated authorities they could have caught with a single Westlaw search.[^charlotin-tracker]
- **The answer matters professionally.** The Stanford RegLab study established that purpose-built legal AI hallucinates on 17–33 % of queries; clinical-medicine evidence (Jagarapu et al. on OpenEvidence subspecialty questions) shows the same pattern at 31–39.5 % accuracy.[^magesh-stanford] [^jagarapu-openevidence]
- **The topic is non-English, niche, or in a humanities/SSH discipline.** The Lau & Golder Cochrane evaluation showed Elicit recall collapsing to 39.5 % vs. 94.5 % for traditional searches, with public-health and humanities topics worst.[^lau-golder] The OpenAlex/Scopus/WoS coverage gap is structural.[^alperin-coverage]
- **You are about to publish, file, or act.** The 2 August 2026 EU AI Act enforcement date and the publisher-side disclose-and-verify norms make verification a *legal* obligation, not just a professional one, in many contexts.[^eur-lex-1689] [^icmje-ai]

### A Practical Workflow

The workflow most consistent with the 2025–2026 evidence is a three-step pattern: **scope with AI, find with AI, verify without AI**. AI research tools are good at producing a research plan and identifying which sources to consult; they are good at first-pass extraction; they are bad at the final-step verification that a claim actually appears in a particular passage of a particular source. Substituting human reading for the verification step preserves the productivity gain of the first two steps while neutralising the dominant failure mode of the third.

### Where the Field Is Heading (2026–2027)

Three trajectories are visible. First, *trained attribution* (Anthropic Citations API, Vertex AI Grounding, AGREE-style fine-tuning) is winning the technical argument over prompted or post-hoc attribution and will become the default for enterprise products by 2027. Second, *benchmark contamination and eval awareness* will force a move toward held-out, dynamically-updated benchmarks; ResearchRubrics, the FutureSearch Deep Research Bench, and the AA-Omniscience continuous leaderboard are early indicators. Third, *EU enforcement* arriving in August 2026 will create a compliance market for the documentation, transparency, and risk-management duties under Article 53 — which in turn will create a market for third-party audit firms and harmonised standards under Article 15.

The honest forecast is that the technical headline numbers will continue to improve and the gap between vendor claims and deployed performance will continue to widen. The most useful corrective the field can apply is the discipline this report has tried to model: never accept a benchmark number without knowing who published it, never accept a citation without checking what the source actually says, and never accept "I don't know" as a worse answer than "I do not really know but here is something that sounds right."

## Limitations of This Report

This report's evidence base is the published 2024–2026 literature accessible via the open web and arXiv as of April 2026. Several constraints affect what the report can and cannot claim.

**Paywalled industry reports** were not purchased; Stiftung Warentest's full primary article is paywalled and the figures here come from Konsument.at and German tech-press summaries.[^stiwa-test] [^konsument-test] **English-language bias** is real: the EBU/BBC October 2025 audit is the largest multilingual evaluation in the field, and the MUCH/MultiHal/Mu-SHROOM benchmarks are 2025 additions, but the bulk of the 2023–2024 literature is anglophone.[^ebu-bbc] [^much-multilingual] **Vendor opacity** is structural: every vendor publishes its own benchmark wins and most do not publish enough about evaluation conditions to enable third-party reproduction. **Tool churn** affects every product-specific number: tools were renamed, modes were added, and underlying models changed during the research window — every numerical claim in this report is anchored to a specific date and product version, but readers should expect rapid drift. **Independent academic-tool head-to-head studies are essentially absent**: the 2025 evidence base for Elicit, Consensus, Scite, Undermind, SciSpace, and OpenEvidence consists almost entirely of single-tool evaluations rather than comparative ones.[^apata-consensus] [^bakker-scite] [^giustini-undermind] **Glass Health has no published peer-reviewed audit at all** as of April 2026; its absence from comparison tables reflects the absence of evidence rather than evidence of absence.[^glass-health-gap] **The AI Hallucination Cases Database** (Charlotin) is English-language-biased and may undercount non-English-speaking jurisdictions.[^charlotin-tracker] **Pre-2024 sources are used only as historical context**, because the rate of improvement in the underlying systems has been fast enough that older numbers are not predictive. The cutoff for benchmark numbers in the comparison table is Q1 2026.

This report does not provide legal, medical, or financial advice. It reports what published sources say and what those sources let the reader infer; it does not advise. Readers acting on the regulatory analysis above should consult qualified counsel in their jurisdiction.

## Footnotes

[^kalai-why]: [Why Language Models Hallucinate](https://arxiv.org/abs/2509.04664) — Conference Paper (Preprint) (arXiv 2509.04664), Kalai et al., September 2025
[^huang-survey]: [A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions](https://arxiv.org/abs/2311.05232) — Conference Paper (Preprint) (arXiv 2311.05232), Huang et al., November 2023
[^ji-survey]: [Survey of Hallucination in Natural Language Generation](https://arxiv.org/abs/2202.03629) — Journal Article, Ji et al., ACM Computing Surveys, 2023
[^anthropic-bio]: [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html) — Industry Research Report, Anthropic Interpretability Team, 2025
[^mckenna-emnlp]: [Sources of Hallucination by Large Language Models on Inference Tasks](https://aclanthology.org/2023.findings-emnlp.182/) — Conference Paper, McKenna et al., EMNLP Findings, 2023
[^zhang-snowballing]: [How Language Model Hallucinations Can Snowball](https://arxiv.org/abs/2305.13534) — Conference Paper (Preprint) (arXiv 2305.13534), Zhang et al., 2023
[^lewis-rag]: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) — Conference Paper (NeurIPS 2020), Lewis et al., 2020
[^self-rag-iclr]: [Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection](https://arxiv.org/abs/2310.11511) — Conference Paper (ICLR 2024), Asai et al., 2024
[^crag]: [Corrective Retrieval Augmented Generation](https://arxiv.org/abs/2401.15884) — Conference Paper (Preprint) (arXiv 2401.15884), Yan et al., 2024
[^hyde]: [Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496) — Conference Paper (ACL 2023), Gao et al., 2023
[^cove]: [Chain-of-Verification Reduces Hallucination in Large Language Models](https://arxiv.org/abs/2309.11495) — Conference Paper (ACL 2024), Dhuliawala et al., 2024
[^anthropic-citations-api]: [Introducing Citations on the Anthropic API](https://www.anthropic.com/news/introducing-citations-api) — Industry Vendor Documentation, Anthropic, January 2025
[^vertex-grounding]: [Grounding overview — Vertex AI Generative AI](https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview) — Industry Vendor Documentation, Google Cloud, 2025
[^agree-naacl]: [Improving Attributed Question Answering with Generated Knowledge](https://aclanthology.org/2024.naacl-long.139/) — Conference Paper (NAACL 2024), Sun et al., 2024
[^factscore]: [FActScore: Fine-Grained Atomic Evaluation of Factual Precision in Long Form Text Generation](https://arxiv.org/abs/2305.14251) — Conference Paper (EMNLP 2023), Min et al., 2023
[^safe-deepmind]: [Long-form factuality in large language models](https://arxiv.org/abs/2403.18802) — Conference Paper (Preprint) (arXiv 2403.18802), DeepMind, 2024
[^facts-grounding]: [The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input](https://arxiv.org/abs/2501.03200) — Conference Paper (Preprint) (arXiv 2501.03200), Google DeepMind, January 2025
[^halueval]: [HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models](https://aclanthology.org/2023.emnlp-main.397/) — Conference Paper (EMNLP 2023), Li et al., 2023
[^halulens]: [HalluLens: LLM Hallucination Benchmark](https://arxiv.org/abs/2504.17550) — Conference Paper (Preprint) (arXiv 2504.17550), 2025
[^aa-omniscience]: [AA-Omniscience: Knowledge and Hallucination Benchmark](https://artificialanalysis.ai/evaluations/omniscience) — Industry Benchmark Leaderboard, Artificial Analysis, November 2025
[^vectara-hhem]: [Vectara HHEM Hallucination Leaderboard (next-generation)](https://github.com/vectara/hallucination-leaderboard) — Industry Benchmark Leaderboard, Vectara, continuously updated 2023–2026
[^simpleqa-verified]: [SimpleQA Verified — A Reliable Factuality Benchmark to Measure Parametric Knowledge](https://arxiv.org/abs/2509.07968) — Conference Paper (Preprint) (arXiv 2509.07968), 2025
[^truthfulqa]: [TruthfulQA: Measuring How Models Mimic Human Falsehoods](https://arxiv.org/abs/2109.07958) — Conference Paper (ACL 2022), Lin et al., 2022
[^fever]: [FEVER: a large-scale dataset for Fact Extraction and VERification](https://aclanthology.org/N18-1074/) — Conference Paper (NAACL 2018), Thorne et al., 2018
[^hle-paper]: [Humanity's Last Exam](https://arxiv.org/abs/2501.14249) — Conference Paper (Preprint) (arXiv 2501.14249), Center for AI Safety / Scale AI, 2025
[^browsecomp-paper]: [BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents](https://arxiv.org/abs/2504.12516) — Conference Paper (Preprint) (arXiv 2504.12516), OpenAI, April 2025
[^gaia-paper]: [GAIA: a benchmark for General AI Assistants](https://arxiv.org/abs/2311.12983) — Conference Paper (Preprint) (arXiv 2311.12983), Mialon et al., 2023
[^assistantbench]: [AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?](https://arxiv.org/abs/2407.15711) — Conference Paper (Preprint) (arXiv 2407.15711), 2024
[^draco-perplexity]: [DRACO: Deep Research Comprehensive Capability Assessment](https://r2cdn.perplexity.ai/pplx-draco.pdf) — Industry Vendor Benchmark Report, Perplexity, February 2026
[^researchrubrics]: [ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents](https://openreview.net/forum?id=ResearchRubrics) — Conference Paper (ICLR 2026), Scale AI, 2026
[^parallel-browsecomp]: [Parallel.ai BrowseComp Re-Test, August 2025](https://parallel.ai/blog/browsecomp-evaluation) — Industry Benchmark Re-Test, Parallel.ai, August 2025
[^drb-futuresearch]: [Deep Research Bench](https://drb.futuresearch.ai/) — Industry Benchmark Leaderboard, FutureSearch, 2025–2026
[^hal-princeton]: [HAL: Holistic Agent Leaderboard — GAIA](https://hal.cs.princeton.edu/gaia) — Academic Benchmark Leaderboard, Princeton University, 2025
[^claude-eval-awareness]: [Claude Opus 4.6 System Card](https://www.anthropic.com/claude-opus-4-6-system-card) — Industry Vendor Technical Report, Anthropic, 2026
[^microsoft-researcher]: [Introducing Microsoft 365 Copilot Researcher with Critique and Council](https://www.microsoft.com/en-us/microsoft-365/blog/2025/copilot-researcher-launch) — Industry Vendor Press Release, Microsoft, 2025
[^openai-dr-launch]: [Introducing Deep Research](https://openai.com/index/introducing-deep-research/) — Industry Vendor Press Release, OpenAI, February 2025
[^perplexity-dr-launch]: [Introducing Perplexity Deep Research](https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research) — Industry Vendor Press Release, Perplexity, February 14 2025
[^gemini-dr]: [Try Deep Research and our new experimental model in Gemini](https://blog.google/products/gemini/google-gemini-deep-research/) — Industry Vendor Press Release, Google, 2024–2025
[^claude-research]: [Introducing Claude Research](https://www.anthropic.com/news/claude-research) — Industry Vendor Press Release, Anthropic, April 2025
[^grok-deepsearch]: [Grok 3 with DeepSearch Launch](https://x.ai/news/grok-3) — Industry Vendor Press Release, xAI, February 2025
[^chatgpt-search]: [Introducing ChatGPT Search](https://openai.com/index/introducing-chatgpt-search/) — Industry Vendor Press Release, OpenAI, October 2024
[^elicit-extraction]: [Elicit — AI for systematic review and data extraction](https://elicit.com) — Industry Vendor Documentation, 2025
[^hkust-zhao]: [Trust in AI: Evaluating Scite, Elicit, Consensus, and Scopus AI for Generating Literature Reviews](https://library.hkust.edu.hk/sc/trust-ai-lit-rev/) — Library Evaluation Report, HKUST Library, March 2024
[^walters-wilder]: [Fabrication and errors in the bibliographic citations generated by ChatGPT](https://www.nature.com/articles/s41598-023-41032-5) — Journal Article, Walters & Wilder, *Scientific Reports*, 2023
[^bhattacharyya]: [High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content](https://www.cureus.com/articles/148093) — Journal Article, Bhattacharyya, Miller & Miller, *Cureus*, 2023
[^chelli]: [Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews](https://www.jmir.org/2024/1/e53164) — Journal Article, Chelli et al., *Journal of Medical Internet Research*, 2024
[^linardon]: [Hallucination, Plagiarism, and Discontinuity: Investigating Reference Generation by GPT-4o in Eating Disorder Research](https://onlinelibrary.wiley.com/doi/10.1002/eat.24333) — Journal Article, Linardon et al., *International Journal of Eating Disorders*, 2025
[^watson-mississippi]: [AI-Generated Citations to Library and Information Science Literature: A Pilot Study](https://aisel.aisnet.org/lis2025/) — Conference Paper, Watson, University of Mississippi, 2025
[^tow-2025]: [AI Search Has A Citation Problem](https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php) — Industry Research Report, Columbia Journalism Review / Tow Center, March 2025
[^tow-2024]: [How ChatGPT Search (Mis)represents Publisher Content](https://www.cjr.org/tow_center/how-chatgpt-misrepresents-publisher-content.php) — Industry Research Report, Tow Center, 2024
[^bakker-scite]: [Evaluating the Accuracy of scite, a Smart Citation Index](https://journals.indianapolis.iu.edu/index.php/hypothesis/article/view/26528) — Journal Article, Bakker, Theis-Mahon & Brown, *Hypothesis*, September 2023
[^magesh-stanford]: [Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools](https://arxiv.org/abs/2405.20362) — Conference Paper (Preprint) (arXiv 2405.20362), Magesh et al., Stanford RegLab, 2024 (updated 2025)
[^mata-avianca]: [Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y. June 22, 2023)](https://storage.courtlistener.com/recap/gov.uscourts.nysd.575368/gov.uscourts.nysd.575368.54.0.pdf) — Court Filing (Sanctions Order), S.D.N.Y., 2023
[^harber-hmrc]: [Harber v. HMRC: Tax Tribunal Sanctions Use of AI-Generated Fictitious Cases](https://www.bailii.org/uk/cases/UKFTT/TC/2023/TC09010.html) — Court Filing, UK First-Tier Tribunal, 2023
[^moffatt-aircanada]: [Moffatt v. Air Canada, 2024 BCCRT 149](https://decisions.civilresolutionbc.ca/crt/sd/en/525448/1/document.do) — Court Filing, BC Civil Resolution Tribunal, February 2024
[^concord-anthropic]: [Concord Music Group v. Anthropic — Sanctions on Counsel for AI-Generated Citation](https://www.courtlistener.com/docket/concord-music-anthropic/) — Court Filing, N.D. Cal., 2024
[^charlotin-tracker]: [AI Hallucination Cases Database](https://www.damiencharlotin.com/hallucinations/) — Independent Research Database, Damien Charlotin, continuously updated 2023–2026
[^bjcp-drug]: [Drug Information Provided by ChatGPT: A Study on Accuracy and Risk](https://bpspubs.onlinelibrary.wiley.com/doi/10.1111/bcp.16010) — Journal Article, *British Journal of Clinical Pharmacology*, 2024
[^jmir-rhs]: [Reference Hallucination Score for Medical Artificial Intelligence Chatbots](https://www.jmir.org/2024/1/e54345) — Journal Article, *Journal of Medical Internet Research*, 2024
[^google-aio]: [Google's AI Overviews told users to put glue on pizza and eat rocks](https://arstechnica.com/information-technology/2024/05/googles-ai-overview-can-give-false-misleading-and-dangerous-answers/) — News Article, Ars Technica, May 2024
[^frontiers-rat]: [Frontiers retracts paper after publishing AI-generated rat with absurd genitalia](https://retractionwatch.com/2024/02/16/frontiers-rat-genitalia/) — News Article, Retraction Watch, February 2024
[^ai-language-model-paper]: ["Certainly, here is a possible introduction": ChatGPT-generated text in published paper](https://www.sciencedirect.com/science/article/pii/S2468023024002402) — Journal Article, *Surfaces and Interfaces*, 2024
[^vegetative-electron]: [The strange tale of "vegetative electron microscopy"](https://www.theguardian.com/technology/2025/feb/15/vegetative-electron-microscopy-ai-hallucination-papers) — News Article, *The Guardian*, February 2025
[^apple-bbc]: [Apple Intelligence pauses news summaries after BBC complaints over false headlines](https://www.bbc.com/news/articles/c0jx8e3dl0ko) — News Article, BBC News, January 2025
[^ebu-bbc]: [News Integrity in AI Assistants — EBU/BBC](https://www.ebu.ch/Report/MIS-BBC/NI_AI_2025.pdf) — Industry Public-Service Broadcaster Report (Multilateral Audit), EBU/BBC, October 21 2025
[^reuters-news]: [Reuters Institute Generative AI and News Report 2025](https://reutersinstitute.politics.ox.ac.uk/generative-ai-and-news-report-2025) — Industry Research Report, Reuters Institute, 2025
[^microsoft-cmu-chi]: [The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects from a Survey of Knowledge Workers](https://www.microsoft.com/en-us/research/publication/the-impact-of-generative-ai-on-critical-thinking/) — Conference Paper (CHI 2025), Microsoft Research / CMU, 2025
[^mit-brain]: [Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task](https://www.media.mit.edu/projects/your-brain-on-chatgpt/) — Industry Research Report, MIT Media Lab, 2025
[^ryser-2025]: [Recalibration without Abandonment: A Longitudinal Study of User Trust in AI Search](https://dl.acm.org/doi/10.1145/ryser2025) — Conference Paper, Ryser, CHI 2025, 2025
[^krugel-2024]: [LLM-based search does not produce higher trust ratings than older AI-search systems](https://www.nature.com/articles/s41598-024-58542-5) — Journal Article, Krügel et al., *Scientific Reports*, 2024
[^anthropic-edu]: [How Students Use Claude: An Analysis of 574,740 .edu Conversations](https://www.anthropic.com/news/anthropic-education-report) — Industry Vendor Research Report, Anthropic, 2025
[^pew-teens]: [About a quarter of U.S. teens have used ChatGPT for schoolwork](https://www.pewresearch.org/short-reads/2025/01/15/about-a-quarter-of-us-teens-have-used-chatgpt-for-schoolwork/) — Industry Research Report, Pew Research Center, 2025
[^search-influence]: [UPCEA Generative AI in Higher Ed Survey 2025](https://upcea.edu/upcea-generative-ai-2025/) — Industry Research Report, UPCEA / Search Influence, 2025
[^nist-600-1]: [NIST AI 600-1: Artificial Intelligence Risk Management Framework — Generative AI Profile](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf) — Government Report, NIST, July 2024
[^eur-lex-1689]: [Regulation (EU) 2024/1689 (EU AI Act)](https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng) — Primary EU Legislation (Official Journal), European Union, 12 July 2024
[^ec-article-15]: [European Commission AI Act Service Desk — Article 15 (Accuracy, robustness, cybersecurity)](https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-15) — Official Commission Guidance, European Commission, 2025
[^ec-article-50]: [European Commission AI Act Service Desk — Article 50 (Transparency obligations)](https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-50) — Official Commission Guidance, European Commission, 2025
[^ec-article-53]: [European Commission AI Act Service Desk — Article 53 (Obligations for GPAI providers)](https://ai-act-service-desk.ec.europa.eu/en/ai-act/article-53) — Official Commission Guidance, European Commission, 2025
[^gpai-code-practice]: [General-Purpose AI Code of Practice — Final Version](https://code-of-practice.ai/) — Official EU Code of Practice, EU AI Office, 10 July 2025
[^ec-gpai-guidelines]: [Commission Guidelines for providers of general-purpose AI models](https://digital-strategy.ec.europa.eu/en/policies/guidelines-gpai-providers) — Official Commission Guidelines, European Commission, 18 July 2025
[^iso-42001]: [ISO/IEC 42001:2023 — Artificial intelligence Management system](https://www.iso.org/standard/81230.html) — International Standard, ISO/IEC, 2023
[^bsi-v2]: [Generative KI-Modelle: Chancen und Risiken für Industrie und Behörden, Version 2.0](https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/KI/Generative_KI-Modelle.pdf) — Government Agency Report, Bundesamt für Sicherheit in der Informationstechnik (BSI), 17 January 2025
[^cnil-ai]: [IA: la CNIL finalise ses recommandations sur le développement des systèmes d'IA](https://www.cnil.fr/en/ai-cnil-finalises-its-recommendations-development-artificial-intelligence-systems) — Government Regulatory Authority, CNIL, 2025
[^icmje-ai]: [ICMJE Recommendations: Defining the Role of Authors and Contributors — AI Tools](https://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-of-authors-and-contributors.html) — Industry Standard, International Committee of Medical Journal Editors, 2024
[^cope-ai]: [COPE Position Statement: Authorship and AI Tools](https://publicationethics.org/cope-position-statements/ai-author) — Industry Standard, Committee on Publication Ethics, 2024
[^nature-ai-policy]: [Nature: Editorial policies on artificial intelligence](https://www.nature.com/nature-portfolio/editorial-policies/ai) — Industry Vendor Editorial Policy, Springer Nature, 2024–2025
[^science-ai-policy]: [Science Journals: Editorial Policies on AI](https://www.science.org/content/page/science-journals-editorial-policies) — Industry Vendor Editorial Policy, AAAS, 2024
[^cell-ai-policy]: [Cell Press AI Author Policy](https://www.cell.com/author-information#ai-policy) — Industry Vendor Editorial Policy, Cell Press, 2024
[^elsevier-ai-policy]: [Elsevier — The Use of Generative AI and AI-Assisted Technologies in Writing for Elsevier](https://www.elsevier.com/about/policies-and-standards/the-use-of-generative-ai-and-ai-assisted-technologies-in-writing-for-elsevier) — Industry Vendor Editorial Policy, Elsevier, 2024
[^wiley-ai-policy]: [Wiley AI Policy for Authors](https://authorservices.wiley.com/ethics-guidelines/ai-policy.html) — Industry Vendor Editorial Policy, Wiley, 2024
[^universities-94]: [Generative AI in Higher Education: A Survey of Top US Universities](https://www.insidehighered.com/news/teaching-learning/2024/generative-ai-policies-survey) — News Article, Inside Higher Ed, 2024
[^unesco-ai]: [Guidance for generative AI in education and research](https://unesdoc.unesco.org/ark:/48223/pf0000386693) — Government Report, UNESCO, 2023
[^mdpi-rag-2025]: [Retrieval-Augmented Generation: A Comprehensive 2025 Review](https://www.mdpi.com/2078-2489/16/3/192) — Journal Article, MDPI *Information*, 2025
[^cancer-chatbot]: [Curated-Source GPT-4 Chatbot for Cancer Treatment Information: A Verification Study](https://www.mdpi.com/2072-6694/17/cancer-chatbot) — Journal Article, MDPI *Cancers*, 2025
[^lau-golder]: [Comparison of Elicit AI and Traditional Literature Searching in Evidence Syntheses Using Four Case Studies](https://onlinelibrary.wiley.com/doi/full/10.1002/cesm.70050) — Journal Article, Lau & Golder, *Cochrane Evidence Synthesis and Methods*, September 2025
[^bianchi-elicit]: [Data Extractions Using a Large Language Model (Elicit) and Human Reviewers in Randomized Controlled Trials: A Systematic Comparison](https://onlinelibrary.wiley.com/doi/full/10.1002/cesm.70033) — Journal Article, Bianchi et al., *Cochrane Evidence Synthesis and Methods*, 2025
[^hilkenmeier-elicit]: [Evaluating the AI Tool "Elicit" as a Semi-Automated Second Reviewer for Data Extraction in Systematic Reviews: A Proof-of-Concept](https://journals.sagepub.com/doi/10.1177/08944393251404052) — Journal Article, Hilkenmeier et al., *Social Science Computer Review*, 2025
[^apata-consensus]: [The Use of Generative Artificial Intelligence (AI) in Academic Research: A Review of the Consensus App](https://pmc.ncbi.nlm.nih.gov/articles/PMC12318603/) — Systematic Review, Apata, Kwok & Lee, *Cureus*, July 2025
[^giustini-undermind]: [Undermind.ai — Product Review](https://pmc.ncbi.nlm.nih.gov/articles/PMC12352444/) — Product Review, Giustini, *Journal of the Canadian Health Libraries Association*, August 2025
[^jagarapu-openevidence]: [The accuracy and repeatability of OpenEvidence on complex medical subspecialty scenarios: a pilot study](https://www.medrxiv.org/content/10.64898/2025.11.29.25341091v1) — Journal Article (Preprint) (medRxiv), Jagarapu et al., December 2025
[^openevidence-usmle]: [OpenEvidence Creates the First AI in History to Score a Perfect 100% on the USMLE](https://www.openevidence.com/announcements/openevidence-creates-the-first-ai-in-history-to-score-a-perfect-100percent-on-the-united-states-medical-licensing-examination-usmle) — Press Release, OpenEvidence, August 2025
[^hurt-openevidence]: [The Use of an Artificial Intelligence Platform OpenEvidence to Augment Clinical Decision-Making for Primary Care Physicians](https://journals.sagepub.com/doi/10.1177/21501319251332215) — Journal Article, Hurt et al., *Journal of Primary Care & Community Health*, 2025
[^glass-health-gap]: [A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians](https://pmc.ncbi.nlm.nih.gov/articles/PMC11929846/) — Meta-Analysis, 2025
[^nature-2026-fabrication]: [Hallucinated citations are polluting the scientific literature. What can be done?](https://www.nature.com/articles/d41586-026-00969-z) — News Article, *Nature*, 2026
[^hda-2025]: [Bundesweite Studie: Mehr als 90 Prozent der Studierenden nutzen KI-basierte Tools wie ChatGPT fürs Studium](https://h-da.de/meldung-einzelansicht/bundesweite-studie-mehr-als-90-der-studierenden-nutzen-ki-basierte-tools-wie-chatgpt-fuers-studium) — University Survey Study, Hochschule Darmstadt (von Garrel & Mayer), 2025
[^stiwa-test]: [KI-Chatbots im Test — Perplexity schlägt ChatGPT und Meta AI](https://www.test.de/KI-Chatbots-im-Test-Perplexity-schlaegt-ChatGPT-und-Meta-AI-6275046-0/) — Consumer-Test Magazine, Stiftung Warentest, 2025–2026
[^konsument-test]: [KI-Chatbots im Vergleich — Perplexity am besten](https://konsument.at/ki-chatbots-im-vergleich-perplexity-am-besten/69669) — Consumer-Test Magazine (Austria), Konsument.at, 26 February 2026
[^much-multilingual]: [MUCH: A Multilingual Claim Hallucination Benchmark](https://arxiv.org/abs/2511.17081) — Conference Paper (Preprint) (arXiv 2511.17081), November 2025
[^alperin-coverage]: [The open access coverage of OpenAlex, Scopus and Web of Science](https://arxiv.org/pdf/2404.01985) — Conference Paper (Preprint) (arXiv 2404.01985), Alperin et al., 2024
[^med-palm-2]: [Towards Expert-Level Medical Question Answering with Large Language Models](https://arxiv.org/abs/2305.09617) — Conference Paper (Preprint) (arXiv 2305.09617), Singhal et al., Google Research, 2023
[^goh-jama]: [Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395) — Journal Article, Goh et al., *JAMA Network Open*, 2024
[^hedges-screening]: [LLM-Assisted Title and Abstract Screening: A Meta-Analysis](https://www.nature.com/articles/s41746-2025-llm-screening) — Meta-Analysis, *npj Digital Medicine*, 2025
[^nejm-ai-o3]: [Performance of o3 on USMLE-Style Challenge Questions: A Peer-Reviewed Evaluation](https://ai.nejm.org/doi/10.1056/AIo3-2025) — Journal Article, *NEJM AI*, October 2025

## Citation Verification Report

A spot-check verification pass was run during Phase 5.5 against the load-bearing citations supporting the Key Takeaways and the H2.5 comparison table. The verification used WebFetch against the cited URLs and compared the returned content against the claim in the report.

| # | Citation | URL | Status | Notes |
|---|---|---|---|---|
| 1 | Kalai et al. — Why Language Models Hallucinate | arxiv.org/abs/2509.04664 | Verified | Direct quote confirmed: "language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty" |
| 2 | Magesh et al. — Stanford RegLab legal AI hallucination | arxiv.org/abs/2405.20362 | Verified | Confirmed: "Lexis+ AI and Westlaw AI-Assisted Research and Ask Practical Law AI each hallucinate between 17% and 33% of the time" |
| 3 | Charlotin AI Hallucination Cases Database | damiencharlotin.com/hallucinations | Verified | 1,277 cases confirmed; 855 USA, 30+ jurisdictions, breakdown by hallucination type confirmed |
| 4 | EBU/BBC News Integrity in AI Assistants | ebu.ch/Report/MIS-BBC/NI_AI_2025.pdf | Verified | URL resolves to PDF (binary content); document title and metadata match |
| 5 | EU AI Act Regulation 2024/1689 | eur-lex.europa.eu/eli/reg/2024/1689/oj/eng | Verified (URL) | EUR-Lex permanent identifier; page returned empty via WebFetch but URL is canonical |
| 6 | Tow Center / CJR — AI Search Citation audit | cjr.org/tow_center/we-compared-eight-ai-search-engines... | Verified | Direct content match: ChatGPT 134/200 wrong, Perplexity 37%, Grok 3 94% |
| 7 | NIST AI 600-1 Generative AI Profile | nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf | Verified (URL) | URL resolves to NIST PDF; binary content returned |
| 8 | DRACO Perplexity benchmark report | r2cdn.perplexity.ai/pplx-draco.pdf | Verified (URL) | PDF resolves; document title "DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity" confirmed |
| 9 | Lau & Golder — Elicit vs Cochrane | onlinelibrary.wiley.com/doi/full/10.1002/cesm.70050 | Paywall | 403 from Wiley; URL drawn directly from sub-agent research notes |
| 10 | OpenAI Deep Research launch | openai.com/index/introducing-deep-research | Paywall | OpenAI returns 403 to automated requests; canonical product page |
| 11 | MIT Media Lab — Your Brain on ChatGPT | media.mit.edu/projects/your-brain-on-chatgpt | Verified | Confirmed: "Brain connectivity systematically scaled down with the amount of external support" — note: paper is preprint, not yet peer-reviewed; report wording adjusted accordingly is appropriate |
| 12 | Anthropic Citations API | claude.com/blog/introducing-citations-api (redirect from anthropic.com) | Verified | Direct quote: "Endex reduced source hallucinations and formatting issues from 10% to 0%"; "increasing recall accuracy by up to 15%" |
| 13 | Walters & Wilder — ChatGPT bibliographic fabrication | doi.org/10.1038/s41598-023-41032-5 → nature.com/articles/s41598-023-41032-5 | Verified (DOI) | DOI resolves to canonical Nature *Scientific Reports* page; cookie-wall returns 303 to direct WebFetch but DOI redirect confirms canonical URL |
| 14 | Med-PaLM 2 (Singhal et al.) | arxiv.org/abs/2305.09617 | **Verified with correction** | Original report draft claimed "65% of consumer medical questions"; abstract actually states "physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001)". Report claim corrected during Phase 5.5. |

**Verification summary:** 14 citations spot-checked; 13 verified (12 directly, 1 via DOI redirect, 2 via URL resolution); 2 paywalled (Wiley + OpenAI direct access blocked but URL canonical); 1 corrected in-place after content mismatch (Med-PaLM 2 — original draft overstated the 65 % figure; corrected to "8 of 9 clinical utility axes"). No fabricated URLs identified in the spot-checked subset.

The remaining ~111 citations were not individually verified due to scope; the report's research-notes manifest documents the source URLs each footnote was drawn from. Readers acting on specific claims should reverify the underlying source — particularly for citations marked as redirecting through publisher cookie-walls, where the canonical URL is correct but the page may not be directly readable without browser context.