OpenAI’s Deep Research smashes records for the world’s hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake

The accuracy achieved by the top-scoring AI in the world’s hardest benchmark as improved by 183% in just two weeks
ChatGPT o3-mini now scores up to 13% accuracy depending on capacity
OpenAI Deep Research obliterates competition with 26.6% accuracy result

The world’s hardest AI exam, Humanity’s Last Exam, was launched less than two weeks ago, and we’ve already seen a huge jump in accuracy, with ChatGPT o3-mini and now OpenAI’s Deep Reasoning topping the leaderboard.

The AI benchmark created by experts from around the world contains some of the hardest reasoning problems and questions known to man – it’s so hard, that when I previously wrote about Humanity’s Last Exam in the article linked above, I couldn’t even understand one of the questions, let alone answer it.

At the time of writing that last article, world phenomenon DeepSeek R1 sat at the top of the leaderboard with a 9.4% accuracy score when evaluated only on text (not multi-modal). Now, OpenAI‘s o3-mini, which launched earlier this week, has scored 10.5% accuracy at the o3-mini setting, and 13% accuracy at the o3-mini-high setting, which is more intelligent but takes longer to generate answers.

More impressive, however, is OpenAI’s new AI agent Deep Research’s score on the benchmark, with the new tool scoring 26.6%, a whopping 183% increase in result accuracy in less than 10 days. Now, it’s worth noting that Deep Research has search capabilities which make comparisons slightly unfair, as the other AI models don’t. The ability to search the web is helpful for a test like Humanity’s Last Exam, as it includes some general knowledge-based questions.

That said, the accuracy of results by models taking Humanity’s Last Exam results is steadily improving, and it does make you wonder just how long we’ll need to wait to see an AI model come close to completing the benchmark. Realistically, AI shouldn’t be able to come close any time soon, but I wouldn’t bet against it.

It looks like the latest OpenAI model is very doing well across many topics.My guess is that Deep Research particularly helps with subjects including medicine, classics, and law. pic.twitter.com/x8Ilmq1aQSFebruary 3, 2025

Better, but 26.6% never got me any SATs

OpenAI Deep Research is an incredibly impressive tool, and I’ve been blown away by the examples that OpenAI showed off when it announced the AI agent. Deep Research is able to work as your personal analyst, taking time to conduct intense research and come up with reports and answers that would otherwise take humans hours and hours to complete.

While a score of 26.6% on Humanity’s Last Exam is seriously impressive, especially considering how far the benchmark’s leaderboard has come in just a couple of weeks, it’s still a low score in absolute terms – no one would claim to have passed a test with anything less than 50% in the real world.

Humanity’s Last Exam is an excellent benchmark, and one that will prove invaluable as AI models develop, enabling us to gauge just how far they’ve come. How long will we have to wait to see an AI bypass the 50% mark? And which model will be the first to do so?

2026: The year enterprise AI finally gets to work

New Sky Original action movie Fuze has the most bizarre connection to Taylor Swift’s Eras Tour — and it could have shut down production

Why shouldn’t I indulge in Ninja’s Smart Kettle now it’s hit a record low price at Amazon?

Smeg’s tiny and stylish new milk frother will add a touch of flair to your morning coffee routine

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Best Long-Lasting Press-on Nails for Airport Travel

Olivia Rodrigo Has Been Quietly Teasing the Babydoll Dress From Her ‘You Seem Pretty Sad’ Album Cover for Weeks

Why Mateo’s Change To Night Shift Is Important For The Pitt Season 2 Explained By Jalen Thomas Brooks

8 Action TV Heroes That Are Stronger Than Alan Ritchson’s Reacher

I was rejected 33 times and built a $390 million company — at 48 years old. Age bias in tech is costing us all

Australians cancel Easter travel as worries mount over fuel crisis

The Walmart billionaires next door: Quiet backlash is brewing against the heirs who remade the retailer’s hometown

Q4 impact: Bank stocks slump up to 32% in 3 months, but brokerages bet on SBI, HDFC Bank, 6 more stocks. Check why

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

OpenAI’s Deep Research smashes records for the world’s hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake

2026: The year enterprise AI finally gets to work

Best Long-Lasting Press-on Nails for Airport Travel

I was rejected 33 times and built a $390 million company — at 48 years old. Age bias in tech is costing us all

New Sky Original action movie Fuze has the most bizarre connection to Taylor Swift’s Eras Tour — and it could have shut down production