AIs like ChatGPT fall apart in classic ‘Stroop’ psychological test — and that could stand in the way of achieving artificial general intelligence

A new study tasked AIs with tackling the ‘Stroop’ test
GPT and Claude performed very poorly compared to humans
There are nuances here, but broadly, the researchers argue that improving this side of AIs is crucial for achieving artificial general intelligence

A freshly published study has pointed out a limitation of big-name AI models such as ChatGPT, albeit causing some controversy as the primary piece of research uses now outdated versions of those models – but there are nuances therein, and this doesn’t make the findings irrelevant.

I’ll go into that more shortly, but first, let’s look at the study itself, which was highlighted on Reddit (‘New study reveals top AI models completely fail the classic ‘Stroop’ psychological attention test’) and published via the Oxford University Press in the journal PNAS Nexus.

The research consists of testing the so-called ‘Stroop effect’ with GPT-4o and Claude 3.5 Sonnet. As noted, these aren’t the cutting-edge versions of those AIs (Large Language Models, or LLMs) – but they were at the time the initial study was carried out.

Analysis: another necessary step on the path to AGI?

An AI face in profile against a digital background.

(Image credit: Shutterstock / Ryzhi)

If you’ve scanned through the Reddit thread, you doubtless noticed that, as mentioned at the outset, there’s a lot of flak fired at this study by commenters due to the usage of outdated models of GPT and Claude.

Indeed, these older LLMs are called “state of the art” at one point by the authors – and of course, as already noted, they were cutting-edge when the main study was conducted. Still, this is unfortunate phrasing that should’ve been updated and tweaked now that the paper has just been published (after peer review and so forth).

However, the researchers did conduct tests on GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro in September 2025, although this is somewhat buried in the paper. That more recent testing found that these models offered only “slight” improvements on their predecessors, and that they still exhibited “ongoing executive attention deficiencies, consistent with our comprehensive analysis of earlier transformer models” (as did Gemini 2.5 Pro, which was a new introduction here).

Granted, a smaller sample size was used, but the researchers still argue that overall, their study reflects a fundamental limitation which is “inherent to the architectural constraints of transformer-based LLMs”.

The authors note that a caveat is that GPT-5 in ‘Thinking’ mode can write and then run code to ensure it performs the Stroop test flawlessly – and similar functionality can be utilized by other LLMs – but this is essentially the AI (cleverly) fudging around its inadequacies. It isn’t changing the way it works or reasons more broadly, of course.

The researchers note that transformer architecture innovations for LLMs are focused on enhancing memory capabilities, which fail to address the “core limitations of attention mechanisms, specifically the need for sophisticated alerting, orienting, and executive control networks to enable cognitive flexibility.”

The ultimate aim is effective goal-directed behavior, and the study observes: “Future [LLM] development might benefit from implementing more sophisticated executive control systems that can handle decision conflicts through structured, goal-directed processing rather than relying solely on enhanced memory capabilities.”

The authors argue that “incorporating executive control mechanisms akin to those in biological attention is crucial for achieving artificial general intelligence [AGI].”

Google logo on a black background next to text reading 'Click to follow TechRadar'

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds.

An Apple MacBook Air against a white background

The best laptops for all budgets

https://cdn.mos.cms.futurecdn.net/PAztEScphfxGJfYno5NjrL-2560-80.jpg

Source link

‘We definitely are also analog people’: WiiM’s CEO talks to us about developing its first soundbar, Dolby Atmos FlexConnect, why it doesn’t support Apple...

Quordle hints and answers for Friday, June 5 (game #1593)

NYT Connections hints and answers for Friday, June 5 (game #1090)

NYT Strands hints and answers for Friday, June 5 (game #824)

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Star Wars’ New Legends Show Already Looks Better Than Disney’s Canon

HBO’s Answer to ‘Bridgerton’ Quietly Rejoins the Streaming Charts

The Popular AstroAI S8 Cordless Car Jump Starter Drops Below $20 in Time for Father’s Day Delivery

Joanna Klein Joins Nina Tassler’s Kismet Creative Group As SVP

US stock futures fall as tech remains skittish; nonfarm payrolls on tap

India’s long-term growth story intact despite high valuations: Citigroup CEO Jane Fraser

SpaceX blocked from early US benchmark index entry as S&P reaffirms existing rules

Target recalls baby wipes due to bacterial contamination

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays