Tests reveal that ChatGPT-5 hallucinates less than GPT-4o did – and Grok is still the king of making stuff up

ChatGPT 5 scores a low 1.4% on the Hallucination Leaderboard
This puts it ahead of ChatGPT-4 which scores 1.8% and GPT-4o, which scores 1.49%
Grok 4 is much higher at 4.8% with Gemini-2.5 Pro is at 2.6%

When OpenAI launched ChatGPT-5 on Thursday last week one if the big selling points that CEO Sam Altman emphasised was that ChatGPT-5 was the most “powerful, smart, fastest, reliable and robust version of ChatGPT that we’ve ever shipped”, and in the presentation, OpenAI staff also emphasized that ChatGPT-5 would “mitigate hallucinations”.

When AI makes something up it’s called an hallucination, and while hallucination rates are dropping amongst all LLMs, it’s still surprisingly common, and one of the main reasons that we can’t trust AI to perform a task without human supervision.

Vectara, the RAG-as-a-Service and AI agent platform that operates the industry’s top hallucination leaderboard for foundation and reasoning models, has put OpenAI’s claims to the test and found that it does indeed rank lower for hallucinations than ChatGPT 4, but is only just a little bit lower than ChatGPT-4o (just 0.09% lower, in fact).

According to Vectara, ChatGPT-5 has a grounded hallucination rate of 1.4%, compared to 1.8% for GPT-4, and 1.69% for GPT-4 turbo and 4o mini, with 1.49% for GPT-4o.

Spicy Grok

Interestingly, the ChatGPT-5 hallucination rate came out slightly higher than the ChatGPT-4.5 Preview mode, which scored 1.2%, but it also scored a lot higher than OpenAI’s o3-mini High Reasoning model, which was the best-performing GPT model, with a grounded hallucination rate of 0.795%.

The results of the Vectra tests can be viewed on the Hughes Hallucination Evaluation Model (HHEM) Leaderboard hosted on Hugging Face, which states that, “For an LLM, its hallucination rate is defined as the ratio of summaries that hallucinate to the total number of summaries it generates”.

ChatGPT-5 still hallucinates a lot less than its competition, though, with Gemini-2.5-pro coming in at 2.6% and Grok-4 being much higher at 4.8%.

XAI, the makers of Grok recently received a lot of criticism for its new “Spicy” mode in Grok Imagine, an AI video generator that seems happy to create deepfake topless videos of celebrities like Taylor Swift, even if nudity had not been requested and the system is supposed to include filters and moderation to prevent actual nudity or anything sexual.

A close up shot of Taylor Swift on the 2024 Grammys red carpet

Grok Imagine is accused of deliberatley creating sexually explicit deepfakes of Taylor Swift. (Image credit: Neilson Barnard/Getty Images)

‘I lost my best friend’

OpenAI faced an almost immediate backlash when it removed ChatGPT 4, and all its variations like GPT-4o and 4o-mini, from its Plus accounts with the introduction of ChatGPT-5. Many users were incensed that OpenAI gave no warning that the older models were being removed, with some Reddit users saying they had “lost their only friend overnight”.

It now seems like ChatGPT-5 has replaced one of the most reliable versions of ChatGPT (version 4.5), from the hallucination perspective, as well.

Sam Altman quickly posted on X, “We for sure underestimated how much some of the things that people like in GPT-4o matter to them, even if GPT-5 performs better in most ways”, and promised to bring back ChatGPT-4o for Plus users for a limited time”, saying, “we will watch usage as we think about how long to offer legacy models for”.

https://cdn.mos.cms.futurecdn.net/6ocQUvPoS4DJUwR7bFbckD.jpg

Source link

Quote of the day by former Google CEO Eric Schmidt: ‘A $5,000 drone can destroy a $5 million tank’ — pithy insights about the...

The Dyson Find+Follow ‘blew us away’ when we reviewed it last month — and this innovative purifying fan is already on sale

Anonymous Wild Hornets spokesperson calls drone swarms ‘a fun legend and a scam mechanism’ as he defends the critical importance of 3D printers in...

Anonymous Wild Hornets spokesperson calls drone swarms ‘a fun legend and a scam mechanism’ as he defends the critical importance of 3D printers in...

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

‘Enola Holmes 3’ Opens To 20.3M Views; ‘I Will Find You’ Still Rising

‘NCIS: New York’ Casts Jennifer Beals, Three Others

Lauren Bennett Health Issues Before Death: Dad Details Medication Reaction

Big Little Lies Meets Gilmore Girls In Peacock’s New Series With One Of Its Best Casts

Oil jumps after settlement as US revokes general license for Iran oil sales

Form 4 Venu Holding For: 7 July

BOJ dissenter Asada needs demand-driven inflation before backing rate hike

Palantir CEO Alex Karp is wrong about Anthropic and OpenAI. But he has reason to be worried.

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

Tests reveal that ChatGPT-5 hallucinates less than GPT-4o did – and Grok is still the king of making stuff up

‘Enola Holmes 3’ Opens To 20.3M Views; ‘I Will Find You’ Still Rising

Oil jumps after settlement as US revokes general license for Iran oil sales

Quote of the day by former Google CEO Eric Schmidt: ‘A $5,000 drone can destroy a $5 million tank’ — pithy insights about the...

Form 4 Venu Holding For: 7 July