Claude just beat GPT-5, Gemini, and Grok in real-world job tasks, according to OpenAI’s own study

OpenAI has released GDPval, a new evaluation system to test how AI performs at work-related tasks
Claude Opus 4.1 comes out in the lead, with ‘ChatGPT-5 high’ in second place
Tasks include things like emailing a response to a dissatisfied customer

We’re all familiar with AI benchmarks, which measure performance at certain tasks, but often these tasks don’t reflect the real world and how people actually use AI, especially at work.

To combat this problem, OpenAI, the maker of ChatGPT, is introducing GDPval, a new way of measuring AI model performance using real-world work tasks compared to a real human across 44 occupations, from software developers and lawyers to registered nurses and mechanical engineers.

Surprisingly, the OpenAI study shows that the best performing model was Anthropic’s Claude Opus 4.1, which outpaced not only OpenAI’s GPT-5 but also Gemini and Grok.

GDPval win rate

This graph shows the overall GDPval win rate (the times when the AI did better than an industry expert) and shows that Claude Opus 4.1 is out in the lead with a win rate of 47.6, with ‘ChatGPT-5 high’ coming second with 38.8 and ‘ChatGPT o3 high’ at 34.1. ChatGPT-4o scores the lowest, with a win rate of 12.4, which is significantly behind both Grok 4 and Gemini 2.5 Pro.

The study found that Claude was the highest-performing across eight of the nine industry sectors it tested, including government, health care, and social assistance. The results clearly show that Claude Opus 4.1 leads across a diverse range of work-related tasks.

Claude win rates by sector

(Image credit: OpenAI)

Examples of the tasks include things like emailing a response to a dissatisfied customer requesting a return, optimizing a table layout for a Spring vendor fair, and auditing price inconsistencies in purchase orders.

What’s in a name?

The name used by OpenAI, GDPval, comes from the concept of Gross Domestic Product (GDP) as a key economic indicator. OpenAI wants GPDval to be widely adopted to help ground conversations about future AI improvements in evidence rather than guesswork.

Releasing the results showing a competitor out in front appears to be an exercise in radical transparency by OpenAI, but that fits in perfectly with the company’s philosophy. “Our mission is to ensure that artificial general intelligence benefits all of humanity. As part of our mission, we want to transparently communicate progress on how AI models can help people in the real world”, reads a statement from OpenAI.

The paper, which is available to read in its entirety online, comes a week after OpenAI released a more consumer-focused paper that showed that the majority of ChatGPT users (70%) were actually using it at home, rather than at work.

The study was conducted by OpenAI’s Economic Research team and Harvard economist David Deming for the National Bureau of Economic Research (NBER). The results were surprising to a lot of people, as previously, the focus of new ChatGPT releases has been very focused on work-related tasks like coding, making presentations, and being a good research tool.

The news that Claude Opus 4.1 is better at actual work-related tasks, not just benchmarks, than even ‘ChatGPT-5 high’ could mean a renewed focus by OpenAI towards its changing user base.

https://cdn.mos.cms.futurecdn.net/kQgz8fSBJp3j2YakUJFn4N.jpg

Source link

How to watch Austrian Grand Prix 2026 F1: Live Stream, Preview, Schedule

How to watch Austrian Grand Prix 2026 F1: Live Stream, Preview, Schedule

ICYMI: the 7 biggest tech stories of the week, from GTA 6 pre-orders to our Oura Ring 5 review

Halo Campaign Evolved creative director says the remake will have ‘all the original content’ and new features to bring it into more of a...

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Warner Bros. Teases ‘Get Jiro,’ ‘Hit Squad,’ Unveils ‘Dark Shadows’

Love Lives of Tennis Stars: Serena Williams, Ben Shelton, More

Official New Tomb Raider Gameplay Divides Gamers

Netflix’s Cancelled 2-Part Crime Series Is Getting a Second Chance

Job scams are getting more sophisticated, and they’re costing Americans millions

US auto safety regulator closes evaluation of 441,002 Honda Odyssey vehicles after recall

Big Short legend Steve Eisman says everyone is buying the wrong AI stocks

Trump likely to visit India early next year, Rubio tells India’s IANS

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

Claude just beat GPT-5, Gemini, and Grok in real-world job tasks, according to OpenAI’s own study

How to watch Austrian Grand Prix 2026 F1: Live Stream, Preview, Schedule

Warner Bros. Teases ‘Get Jiro,’ ‘Hit Squad,’ Unveils ‘Dark Shadows’

How to watch Austrian Grand Prix 2026 F1: Live Stream, Preview, Schedule

Love Lives of Tennis Stars: Serena Williams, Ben Shelton, More