How Often Does AI Make Up Citations? What the Research Shows
Often enough that you cannot trust a single reference a general chatbot hands you without checking it. The exact rate depends on the model and the topic, and the published studies put real numbers on it: a 2023 study found 55% of GPT-3.5 citations were fabricated, and even a 2025 test of GPT-4o still found about 1 in 5 made up. The honest answer is a range, not a number, so this page collects every measurement we could verify from 2023 to 2026 and lays out what they add up to. The numbers below are other researchers' findings, not ours; we have linked each one to the primary source so you can check it yourself.
The short answer, by model
Fabrication rates have fallen as models improved, but they have never reached zero in a published test of a general chatbot. The pattern across the studies is consistent: the older and the more general the model, the more it invents, and the narrower the topic, the worse any model does. We have written separately about why AI makes up citations at all. This piece is about how often, with the actual figures.
- GPT-3.5 (2023): around half of all citations fabricated. Two studies put it at 47% and 55%.
- GPT-4 (2023): a sharp drop, to 18% fabricated, but still close to one in five.
- GPT-4o (2025): about 19.9% fabricated on average, rising to nearly 1 in 3 on obscure topics.
- In the published record (2026): roughly 1 in 277 PubMed-indexed papers now carries a fabricated reference, up sharply from a few years ago.
The studies, side by side
Here is every measurement we could verify against the primary source. "Fabrication rate" means the share of generated references that point to a source that does not exist. Several studies also report that many of the references that do exist still contain errors, which is a separate and additional problem.
| Study | Model | Sample | Fabrication rate | Source |
|---|---|---|---|---|
| Bhattacharyya et al., 2023 (Cureus) | ChatGPT (GPT-3.5) | 115 medical references | 47% fabricated; only 7% were both real and accurate | Cureus |
| Walters & Wilder, 2023 (Scientific Reports) | GPT-3.5 / GPT-4 | 636 references, 42 topics | 55% (GPT-3.5), 18% (GPT-4) | Scientific Reports |
| Mugaanyi et al., 2024 (JMIR) | ChatGPT (GPT-3.5) | 102 references, two fields | 27% (natural sciences), 23% (humanities) did not exist | JMIR |
| Linardon et al., 2025 (JMIR Mental Health) | GPT-4o | 176 references, 3 disorders | 19.9% overall; 6% to 29% by topic familiarity | JMIR Mental Health |
| GPTZero analysis, 2026 (NeurIPS 2025) | Mixed (published papers) | 4,841 accepted papers screened | At least 100 confirmed hallucinated citations across 53 papers | GPTZero |
| Retraction Watch analysis, 2026 (PubMed) | Mixed (published papers) | ~2.5 million papers, 97.1M references | 1 in 277 papers in 2026, up from 1 in 2,828 in 2023 | Retraction Watch |
2023: roughly half of all citations were fake
The first wave of measurements landed hard. Bhattacharyya and colleagues asked ChatGPT for medical references and checked 115 of them. Forty-seven percent did not exist at all. Another 46% pointed to real papers but got the details wrong. Just 7%, eight references out of 115, were both real and accurate. Put differently, you had better odds flipping a coin than trusting a citation from that model unchecked.
The most thorough early study came from Walters and Wilder, published in Scientific Reports. They generated 636 citations across 42 topics and verified each one. The headline split is the one worth memorizing: 55% of the GPT-3.5 citations were fabricated, falling to 18% with GPT-4. The jump between model generations is the clearest signal in the whole literature that newer models invent less. The same paper found that even among the references that were real, 43% of the GPT-3.5 ones and 24% of the GPT-4 ones still contained substantive citation errors, so "real" did not mean "correct."
A 2024 cross-disciplinary study in the Journal of Medical Internet Research filled in the middle. Mugaanyi and colleagues tested GPT-3.5 across 102 references and found that about a quarter did not exist, 27% in the natural sciences and 23% in the humanities. The lower number than the 2023 medical study is a reminder that the rate moves with the field and the prompt, not just the model.
2025: better models, still 1 in 5
The natural hope was that GPT-4o and its peers had fixed this. They had not, though they had improved. In 2025, Linardon and colleagues ran a careful experiment in JMIR Mental Health, generating six literature reviews with GPT-4o and verifying all 176 citations. Nineteen point nine percent were entirely fabricated, roughly one in five. Among the 141 that did exist, 45% still had errors.
The more useful finding in that study is what moved the rate. Fabrication tracked topic familiarity. For major depressive disorder, a heavily studied condition, only 6% of citations were fake. For binge eating disorder it rose to 28%, and for body dysmorphic disorder, the least-studied of the three, to 29%. The model invented most where there was the least real text for it to have learned from, which is exactly the situation a student in a niche subfield is usually in. The narrower your thesis question, the higher your personal fabrication rate runs above the headline average.
That study also nailed down why the fakes are so hard to catch. When GPT-4o attached a DOI to a fabricated citation, 64% of those DOIs resolved to real but unrelated papers. A working link is not proof the source is genuine, which is the single most counterintuitive thing on this page and the reason a lot of students stop checking too early. If you have ever clicked a DOI, landed on a real paper, and assumed you were safe, this is the trap. We cover the reliable checks in how to check if a citation is real.
2026: the problem reached the published record
Until recently this was a chatbot problem, measured by researchers prompting models on purpose. In 2026 it crossed into the formal literature. A large analysis covered by Retraction Watch verified 97.1 million references across nearly 2.5 million PubMed-indexed papers and found fabricated references in roughly one in 277 papers published in the first seven weeks of 2026. The same rate was one in 458 in 2025 and one in 2,828 in 2023. In all, the audit flagged 4,406 fabricated references across 2,810 papers, with generative AI the most likely cause.
It even reached the top of the field. A January 2026 analysis by GPTZero screened 4,841 accepted papers from NeurIPS 2025, a leading machine-learning conference, and found at least 100 confirmed hallucinated citations across 53 of them, each having passed review by three or more expert reviewers. If the people who build these models miss invented citations in their own venue, the takeaway for a student is not subtle: nobody downstream is reliably catching fakes for you. We trace what happens after that in hallucinated citations are now getting papers retracted.
Across every verified study from 2023 to 2026, no general chatbot reached a fabrication rate of zero. The best case in a careful 2025 test was still about 1 in 5 citations invented, and the rate climbs toward 1 in 3 on narrow or obscure topics, exactly where students often work.
The state of the evidence in 2026: a plain verdict
Read together, the studies support a few claims with real confidence, and rule out a few comforting ones.
Fabrication is real, measured, and reproducible. This is not anecdote. Independent teams across medicine, the natural sciences, the humanities, and mental health have all generated citations from chatbots and counted the fakes, and they all found a meaningful share. The phenomenon is settled.
Newer models invent less, but "less" is not "safe." The drop from 55% on GPT-3.5 to about 20% on GPT-4o is large and consistent. It is also nowhere near zero. One in five fabricated means that in a short bibliography of ten references, you should expect roughly two invented sources from a general model, and you will not know which two without checking.
The rate is not one number, it is a curve. It rises as the topic gets narrower, the field gets smaller, and the question gets more recent. Headline averages understate the risk for the exact kind of specific, original question a thesis or dissertation is built on.
The DOI is not a safety check. A formatted, resolving DOI tells you almost nothing, because most fabricated DOIs in the 2025 study led to real but unrelated work. Only opening the source and confirming it says what it was cited for actually protects you.
Better prompting does not fix it. Asking a model to "use only real sources" changes its tone, not its ability to verify. The thing that reliably lowers fabrication is retrieval, the model searching for and reading a real paper before it writes the citation. There are honest ways to push a chatbot toward that, which we cover in how to get ChatGPT to cite real sources.
What this means for a student
The studies were run by researchers who know exactly what to look for and still found the fakes only by checking every reference one at a time. You have to do the same, because the alternative is putting your name on a source that was never written. A fabricated citation in a graded paper can mean a failed assignment, a delayed thesis, or an academic-integrity meeting, and "the AI gave it to me" is not a defence when the bibliography is yours.
The practical floor is simple. Verify every reference before it stays in your draft: search the exact title, confirm the source exists in a database you trust, and read enough of it to know it actually supports your claim. The DOI resolving is not enough on its own. None of this is hard, and it takes seconds per entry, but it is not optional anymore.
Where CiteOwl fits
The studies all measure the same root cause: a general chatbot writes a sentence first and then generates a citation to match, so the reference is invented text like any other. The structural fix is to flip the order. Find and read a real paper first, then write only the claim you can attach to it, and there is nothing left to fabricate.
That is how CiteOwl works. It searches real literature, reads what it finds, and cites only what it actually retrieved, showing you the verbatim quote behind each reference so you can confirm the source says what the sentence claims before you accept it. Every edit arrives as a reviewable diff, so nothing enters your draft unseen. It is the same retrieval-first principle behind an AI research writer that cites real sources: the fabrication rate is not lowered by a better prompt, it is removed by never guessing in the first place.
A fabrication rate of zero, by design
CiteOwl looks up real papers and cites only what it found, with the quote behind each one, so there is no invented reference to catch later.
Start writing