CiteOwl

How Often Does AI Make Up Citations? What the Research Shows

Often enough that you cannot trust a single reference a general chatbot hands you without checking it. The exact rate depends on the model and the topic, and the published studies put real numbers on it: a 2023 study found 55% of GPT-3.5 citations were fabricated, and even a 2025 test of GPT-4o still found about 1 in 5 made up. The honest answer is a range, not a number, so this page collects every measurement we could verify from 2023 to 2026 and lays out what they add up to. The numbers below are other researchers' findings, not ours; we have linked each one to the primary source so you can check it yourself.

The short answer, by model

Fabrication rates have fallen as models improved, but they have never reached zero in a published test of a general chatbot. The pattern across the studies is consistent: the older and the more general the model, the more it invents, and the narrower the topic, the worse any model does. We have written separately about why AI makes up citations at all. This piece is about how often, with the actual figures.

The studies, side by side

Here is every measurement we could verify against the primary source. "Fabrication rate" means the share of generated references that point to a source that does not exist. Several studies also report that many of the references that do exist still contain errors, which is a separate and additional problem.

Study Model Sample Fabrication rate Source
Bhattacharyya et al., 2023 (Cureus) ChatGPT (GPT-3.5) 115 medical references 47% fabricated; only 7% were both real and accurate Cureus
Walters & Wilder, 2023 (Scientific Reports) GPT-3.5 / GPT-4 636 references, 42 topics 55% (GPT-3.5), 18% (GPT-4) Scientific Reports
Mugaanyi et al., 2024 (JMIR) ChatGPT (GPT-3.5) 102 references, two fields 27% (natural sciences), 23% (humanities) did not exist JMIR
Linardon et al., 2025 (JMIR Mental Health) GPT-4o 176 references, 3 disorders 19.9% overall; 6% to 29% by topic familiarity JMIR Mental Health
GPTZero analysis, 2026 (NeurIPS 2025) Mixed (published papers) 4,841 accepted papers screened At least 100 confirmed hallucinated citations across 53 papers GPTZero
Retraction Watch analysis, 2026 (PubMed) Mixed (published papers) ~2.5 million papers, 97.1M references 1 in 277 papers in 2026, up from 1 in 2,828 in 2023 Retraction Watch

2023: roughly half of all citations were fake

The first wave of measurements landed hard. Bhattacharyya and colleagues asked ChatGPT for medical references and checked 115 of them. Forty-seven percent did not exist at all. Another 46% pointed to real papers but got the details wrong. Just 7%, eight references out of 115, were both real and accurate. Put differently, you had better odds flipping a coin than trusting a citation from that model unchecked.

The most thorough early study came from Walters and Wilder, published in Scientific Reports. They generated 636 citations across 42 topics and verified each one. The headline split is the one worth memorizing: 55% of the GPT-3.5 citations were fabricated, falling to 18% with GPT-4. The jump between model generations is the clearest signal in the whole literature that newer models invent less. The same paper found that even among the references that were real, 43% of the GPT-3.5 ones and 24% of the GPT-4 ones still contained substantive citation errors, so "real" did not mean "correct."

A 2024 cross-disciplinary study in the Journal of Medical Internet Research filled in the middle. Mugaanyi and colleagues tested GPT-3.5 across 102 references and found that about a quarter did not exist, 27% in the natural sciences and 23% in the humanities. The lower number than the 2023 medical study is a reminder that the rate moves with the field and the prompt, not just the model.

2025: better models, still 1 in 5

The natural hope was that GPT-4o and its peers had fixed this. They had not, though they had improved. In 2025, Linardon and colleagues ran a careful experiment in JMIR Mental Health, generating six literature reviews with GPT-4o and verifying all 176 citations. Nineteen point nine percent were entirely fabricated, roughly one in five. Among the 141 that did exist, 45% still had errors.

The more useful finding in that study is what moved the rate. Fabrication tracked topic familiarity. For major depressive disorder, a heavily studied condition, only 6% of citations were fake. For binge eating disorder it rose to 28%, and for body dysmorphic disorder, the least-studied of the three, to 29%. The model invented most where there was the least real text for it to have learned from, which is exactly the situation a student in a niche subfield is usually in. The narrower your thesis question, the higher your personal fabrication rate runs above the headline average.

That study also nailed down why the fakes are so hard to catch. When GPT-4o attached a DOI to a fabricated citation, 64% of those DOIs resolved to real but unrelated papers. A working link is not proof the source is genuine, which is the single most counterintuitive thing on this page and the reason a lot of students stop checking too early. If you have ever clicked a DOI, landed on a real paper, and assumed you were safe, this is the trap. We cover the reliable checks in how to check if a citation is real.

2026: the problem reached the published record

Until recently this was a chatbot problem, measured by researchers prompting models on purpose. In 2026 it crossed into the formal literature. A large analysis covered by Retraction Watch verified 97.1 million references across nearly 2.5 million PubMed-indexed papers and found fabricated references in roughly one in 277 papers published in the first seven weeks of 2026. The same rate was one in 458 in 2025 and one in 2,828 in 2023. In all, the audit flagged 4,406 fabricated references across 2,810 papers, with generative AI the most likely cause.

It even reached the top of the field. A January 2026 analysis by GPTZero screened 4,841 accepted papers from NeurIPS 2025, a leading machine-learning conference, and found at least 100 confirmed hallucinated citations across 53 of them, each having passed review by three or more expert reviewers. If the people who build these models miss invented citations in their own venue, the takeaway for a student is not subtle: nobody downstream is reliably catching fakes for you. We trace what happens after that in hallucinated citations are now getting papers retracted.

Across every verified study from 2023 to 2026, no general chatbot reached a fabrication rate of zero. The best case in a careful 2025 test was still about 1 in 5 citations invented, and the rate climbs toward 1 in 3 on narrow or obscure topics, exactly where students often work.

The state of the evidence in 2026: a plain verdict

Read together, the studies support a few claims with real confidence, and rule out a few comforting ones.

Fabrication is real, measured, and reproducible. This is not anecdote. Independent teams across medicine, the natural sciences, the humanities, and mental health have all generated citations from chatbots and counted the fakes, and they all found a meaningful share. The phenomenon is settled.

Newer models invent less, but "less" is not "safe." The drop from 55% on GPT-3.5 to about 20% on GPT-4o is large and consistent. It is also nowhere near zero. One in five fabricated means that in a short bibliography of ten references, you should expect roughly two invented sources from a general model, and you will not know which two without checking.

The rate is not one number, it is a curve. It rises as the topic gets narrower, the field gets smaller, and the question gets more recent. Headline averages understate the risk for the exact kind of specific, original question a thesis or dissertation is built on.

The DOI is not a safety check. A formatted, resolving DOI tells you almost nothing, because most fabricated DOIs in the 2025 study led to real but unrelated work. Only opening the source and confirming it says what it was cited for actually protects you.

Better prompting does not fix it. Asking a model to "use only real sources" changes its tone, not its ability to verify. The thing that reliably lowers fabrication is retrieval, the model searching for and reading a real paper before it writes the citation. There are honest ways to push a chatbot toward that, which we cover in how to get ChatGPT to cite real sources.

What this means for a student

The studies were run by researchers who know exactly what to look for and still found the fakes only by checking every reference one at a time. You have to do the same, because the alternative is putting your name on a source that was never written. A fabricated citation in a graded paper can mean a failed assignment, a delayed thesis, or an academic-integrity meeting, and "the AI gave it to me" is not a defence when the bibliography is yours.

The practical floor is simple. Verify every reference before it stays in your draft: search the exact title, confirm the source exists in a database you trust, and read enough of it to know it actually supports your claim. The DOI resolving is not enough on its own. None of this is hard, and it takes seconds per entry, but it is not optional anymore.

Where CiteOwl fits

The studies all measure the same root cause: a general chatbot writes a sentence first and then generates a citation to match, so the reference is invented text like any other. The structural fix is to flip the order. Find and read a real paper first, then write only the claim you can attach to it, and there is nothing left to fabricate.

That is how CiteOwl works. It searches real literature, reads what it finds, and cites only what it actually retrieved, showing you the verbatim quote behind each reference so you can confirm the source says what the sentence claims before you accept it. Every edit arrives as a reviewable diff, so nothing enters your draft unseen. It is the same retrieval-first principle behind an AI research writer that cites real sources: the fabrication rate is not lowered by a better prompt, it is removed by never guessing in the first place.

A fabrication rate of zero, by design

CiteOwl looks up real papers and cites only what it found, with the quote behind each one, so there is no invented reference to catch later.

Start writing

Things worth knowing.

How often does AI make up citations?

It depends entirely on the model and the topic. Older models were extreme: a 2023 study found 55% of GPT-3.5 citations were fabricated. Newer general models are better but not safe: a 2025 study of GPT-4o found about 1 in 5 citations were completely invented. The rate also climbs for narrow or obscure topics, where the same model fabricated closer to 1 in 3. No general chatbot tested in the published research reached a fabrication rate of zero.

Do newer AI models still fabricate references?

Yes, just less often. Across the studies, fabrication dropped from roughly half of citations on GPT-3.5 in 2023 to about one in five on GPT-4o in 2025. That is real progress, but one in five is still a fabricated reference in most short bibliographies. The improvement reduces the odds; it does not remove the need to verify every source.

Does telling the AI to only use real sources fix it?

No. A prompt like that makes the model sound more careful but gives it no way to check whether a paper exists. Studies that varied prompt specificity still found high fabrication on unfamiliar topics. What actually lowers the rate is retrieval: a tool that searches for and reads a real paper before it writes the citation, rather than generating one from memory.

Are the fake citations easy to spot?

No, that is the danger. Fabricated references carry plausible authors, real-sounding journals, and correctly formatted DOIs. In one 2025 study, 64% of the DOIs attached to fake citations resolved to real but unrelated papers, so a working link is not proof the source is genuine. The only reliable check is to open the source and confirm it exists and says what it was cited for.

Read next.