When a 2022 Hallucination Burn Came Back: Comparing GPT-4.1 and GPT-5 After Gemini 2.0 Flash’s 0.7% Claim

Posted on 2026-03-05 10:07:30

How a biased vendor claim forced a product team to retest summarization models

In 2022 our team implemented GPT-3.5 for summarization tasks inside a legal-document intake pipeline. Over-assertive summaries introduced false facts into downstream automation. That mistake cost a client $27,400 in remediation and trust. After that incident we adopted strict testing before any model change.

In November 2025 Google published a public headline saying "Gemini 2.0 Flash: 0.7% hallucination on basic summarization." That figure was sufficiently low to make our product team ask: is upgrading to OpenAI's GPT-5 now the safer route compared with sticking with GPT-4.1? We needed an independent, reproducible comparison of GPT-4.1 and GPT-5, plus a sanity check against Gemini's claim.

The summarization fidelity problem: why vendor percentages can mislead product decisions

We had one AI hallucination rates concrete risk to solve: choose which model to deploy for single-document summarization of regulatory and client-provided texts. Decision drivers were hallucination rate (false factual assertions not supported by source), latency, cost, and ability to run deterministic prompts (temperature control).

Vendor statements like "0.7% hallucination" are seductive The original source because they look precise. In our context they created three issues:

Unclear metric definition - was that extrinsic hallucination, intrinsic error, or both? Unknown dataset composition - did the vendor exclude edge-case documents that inflate performance? Opaque prompt/mode - what temperature, truncation, or post-processing was used?

We defined the problem concisely: we needed a production-grade head-to-head test between GPT-4.1 and GPT-5, with replication against the Gemini claim. The test had to be transparent, reproducible, and focused on factual fidelity.

An experiment design that treats hallucination as a measurable defect

We rejected black-box vendor claims and designed an experiment that would be defensible in a compliance review. Key design decisions:

Dataset: 1,200 real-world documents (regulatory memos, client briefs, news wires) sampled from our 2023-2025 archive. Average length: 1,600 words. We kept PII removed. Sampling date range: documents dated 2023-01-01 to 2025-08-31. Task: generate a 3-4 sentence factual summary answering "What are the three most important facts in this document?" This prompt maps to our production need. Models tested: "GPT-4.1" and "GPT-5" as labeled by the vendor APIs we accessed for testing during Dec 3-12, 2025. We also sent the same prompts to a Gemini 2.0 Flash instance on Dec 8, 2025 for replication of the 0.7% claim. Temperature settings: 0.0 and 0.7. We tested both because vendor numbers often omit temperature, and fidelity varies with sampling. Annotations: three human annotators per summary recorded extrinsic hallucination count (number of unsupported factual assertions) and intrinsic errors (contradictions, missing key facts). Annotator agreement was tracked with Cohen's kappa. Sample size for primary metric: 1,200 documents x 2 temps x 3 models = 7,200 summaries. We report 95% confidence intervals when relevant.

Running the test: step-by-step execution over 10 days

We executed the plan over ten calendar days in December 2025. Steps and controls were:

Finalize dataset and deidentify documents - completed Dec 1, 2025. Freeze prompt template and test harness code - Dec 2, 2025. Prompt: "In three short sentences, list the three most important factual points in the following document. Only include facts stated in the document." No chain-of-thought or internal explanations requested. Run initial smoke tests to validate API consistency (Dec 3). Verified that identical inputs at temperature 0.0 returned deterministic outputs for each model. Batch requests - Dec 4-6. Sent summaries at temp 0.0 and 0.7 for GPT-4.1 and GPT-5. Each call logged model version string, response tokens, latency, and cost. Gemini replication - Dec 8. Sent same prompt/dataset to Gemini 2.0 Flash instance with temp 0.0 and 0.7. Noted differences in tokenization and truncation handling. Human annotation - Dec 6-10. Each summary was independently labeled by three annotators. Guideline: mark any factual claim not supported by the document as an extrinsic hallucination. Record the count and classify severity (minor, moderate, critical). Quality control - Dec 10. Re-annotated 5% random samples for kappa calculation; Cohen's kappa = 0.71 for extrinsic labels. Analysis and significance testing - Dec 11-12.

From raw outputs to measured hallucination rates: the results

We focus on extrinsic hallucination rate (a binary measure: summary contains at least one unsupported factual claim) because that is the clearest production risk.

Model Temperature Summaries (n) Extrinsic Hallucination Rate 95% CI GPT-4.1 0.0 1,200 3.8% 2.9% - 4.9% GPT-4.1 0.7 1,200 5.5% 4.4% - 6.8% GPT-5 0.0 1,200 1.9% 1.3% - 2.7% GPT-5 0.7 1,200 2.8% 2.0% - 3.7% Gemini 2.0 Flash (replicated) 0.0 1,200 1.7% 1.2% - 2.5% Gemini 2.0 Flash (replicated) 0.7 1,200 2.4% 1.7% - 3.3%

Key observations:

GPT-5 showed roughly half the extrinsic hallucination rate of GPT-4.1 at deterministic temperature (0.0): 1.9% vs 3.8%. The difference is statistically significant (chi-square p < 0.001). All models increased hallucination rates when temperature rose to 0.7. The absolute risk increase was model-dependent (GPT-4.1 added ~1.7 percentage points; GPT-5 added ~0.9). Gemini's claimed 0.7% did not replicate on our dataset. Our best replication for Flash was 1.7% at temp 0.0. That divergence likely stems from dataset selection, different hallucination definitions, or post-processing filters applied by the vendor.

Why reported vendor numbers differ: methodological issues that matter

We traced the gap between Gemini's 0.7% claim and our 1.7% replication to three factors:

Dataset filtering: Vendors often exclude documents with ambiguous facts, citations, or embedded tables that are frequent hallucination triggers. Our sample included such documents deliberately because they reflect production inputs. Annotation threshold: Vendors sometimes count only "critical" hallucinations. We counted any unsupported factual assertion. Changing the threshold to count only critical errors would drop our rates by roughly 40% for all models. Post-processing: Internal vendor pipelines may run fact-check modules or aggressive heuristics to prune or rephrase outputs. The public headline rarely clarifies whether such filters are part of the model score.

When vendors report near-zero percentages, ask for the dataset, the definition, and whether post-generation filters were applied.

3 critical lessons our product team learned from this head-to-head

Measurement beats a slogan. Independently measured rates will almost always differ from vendor numbers. Plan for the higher number when budgeting for review or mitigation. Temperature control matters. If you need deterministic, low-hallucination outputs, run models at temperature 0.0. Expect a modest latency or creativity trade-off. Annotation standards change outcomes. Document how you count hallucinations. Use inter-annotator agreement thresholds and publish your labeling guide inside the team to avoid shifting definitions.

How to decide which model to deploy in your production pipeline

We distilled the choice into a decision flow that weighs expected residual risk, cost, and operational controls.

Quantify tolerance: What is the acceptable residual hallucination rate? For high-stakes legal/regulatory workflows we set a 0.5% operational target for extrinsic hallucinations after human review. Calculate required human review coverage. Using our observed rates, to get residual risk under 0.5% without other controls you need a manual review on 100% of summaries produced by GPT-4.1. With GPT-5, you could reduce manual review to ~60% if you also use a simple automated fact-checking filter. Factor cost and latency. GPT-5 call cost was roughly 1.6x GPT-4.1 during our test period; latency was 12% higher on average. Those trade-offs mattered for a real-time client workflow. Prefer layered defenses: low-temperature generation + automated fact-check + targeted human review on flagged summaries.

Checklist to replicate our test in your environment

Use a representative dataset of at least 1,000 documents; keep sampling stratified by document type. Fix prompts and temperature. Report both deterministic and stochastic runs. Annotate with at least three labelers and publish your labeling guide. Report both extrinsic and intrinsic hallucination rates and provide confidence intervals. Declare any post-generation filters applied before counting errors.

Interactive self-assessment: is your application ready to pick GPT-5 over GPT-4.1?

Answer yes/no to the following. Count yes answers.

Does your dataset closely match the test distribution used here (legal/regulatory/long-form documents)? Do you require summaries in deterministic formats (no creative language) most of the time? Can you afford 1.6x model cost for faster reduction in hallucinations? Do you have a human-review plan for outputs flagged by an automated checker? Are you prepared to run periodic rebenchmarks when models receive updates?

Interpretation:

4-5 yes: GPT-5 likely gives material risk reduction and may be worth the higher cost. 2-3 yes: Consider running a hybrid approach - GPT-5 for high-risk docs, GPT-4.1 for low-risk or bulk processing with review. 0-1 yes: Stick with GPT-4.1 and invest in external safeguards instead of an immediate model upgrade.

Final recommendation and operational guardrails

Based on our December 2025 tests:

GPT-5 delivered materially lower extrinsic hallucination rates than GPT-4.1 under identical conditions. If your production need is minimizing unsupported factual assertions, GPT-5 is better in this narrow metric. Do not take vendor single-number claims at face value. The 0.7% Gemini headline did not reflect our unfiltered replication on a representative dataset. Operationalize a multi-layer mitigation plan: deterministic generation (temp 0.0), an automated fact-checker tuned to your domain, and selective human review. That combination achieved our 0.5% residual target with cost and throughput that met SLAs.

We never fully removed the need for human oversight. Models improved since 2022, but the cost of a false factual assertion in regulated workflows remains high. Use vendor claims as a conversation starter, not a final decision metric. Reproduce the test with your data, report your definitions, and keep a log of model labels and versions for auditability.

Quick quiz to test your team's readiness (answers at end)

Why should you test at temperature 0.0 even if production will use 0.7 sometimes? What is the main difference between extrinsic and intrinsic hallucination? Name two reasons vendor hallucination percentages might be lower than your replication.

Quiz answers: 1) To measure deterministic fidelity and baseline hallucination without sampling noise; 2) Extrinsic = claims unsupported by source, Intrinsic = contradictions or misinterpretations within the output relative to the source; 3) Dataset filtering and post-generation filtering or different error thresholds.

If your team wants our test harness and annotation guide (JSON schema, prompt templates, and inter-annotator guide), we can share a sanitized version so you can run a direct replication. That step is the only way to move from vendor headlines to defensible product decisions.