‘Current LLMs introduce substantial errors when editing work documents’: Microsoft scientists find most AI models struggle with long-running tasks — so maybe don’t trust them completely just yet

Microsoft researchers determine that current LLMs aren’t good at long-running tasks
More interactions and less structure significantly reduce benchmark performance
“Python is the only domain where most models are ready”

New research from a trio of Microsoft workers has uncovered a fundamental issue that could be blocking effective agentic AI -namely that most AI models can’t actually reliably handle long-running workflows.

To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more.

AI isn’t that good at long-running tasks, yet

The study goes into some of the latest AI models including Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4. It found that even they “corrupt an average of 25% of document content by the end of long workflows,” with lesser models even more likely to get things wrong.

The DELEGATE-52 benchmark uses real documents at around 15K tokens in length and introduced 5-10 complex editing tasks with a “round-trip relay simulation” that asks AI to perform a transformation then reverse it. This allows the researchers to measure how effectively each model reconstructs the documents back to their original forms.

Highly structured and programmatic areas were where the models performed best, with the Microsoft researchers concluding that “Python is the only domain where most models are ready.” Conversely, natural language workflows, creative areas and semi-structured documents saw model models struggle.

The paper also uncovers that, the longer the token length, the more likely an AI model is to struggle.

Where frontier models differed was not in their ability to eliminate errors – just that they were able to delay errors. Some of the other models tested by Microsoft’s researchers included a number of GPT-5 and GPT-4 generations, Claude options, Gemini models and one each from Mistral, xAI and Moonshot – totalling 19 different models from six families.

Gemini 3.1 Pro took first place with a DELEGATE-52 benchmark score of 80.9% after 20 interactions; Claude 4.6 Opus (73.1%) and GPT-5.4 (71.5%) round out the top three, and GPT 5 Nano (10.0%) falls into last place.

In short, the paper concludes that today’s AI models are not reliable enough to be trusted for long-running, autonomous workflows, highlighting key areas where model developers must focus on in the future and offering up yet another benchmark to determine model capability.

The Register

Google logo on a black background next to text reading 'Click to follow TechRadar'

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds.

https://cdn.mos.cms.futurecdn.net/Rb6YDzdRZjccpn6MQ26KML-2560-80.jpg

Source link

Surfshark’s FastTrack now covers more than 2.000 servers — your city might be on the list too

Elegoo made an emoji 3D printer for World Emoji Day (yes, really)

Surfshark’s FastTrack now covers more than 2.000 servers — your city might be on the list too

7 new movies and TV shows to watch on Netflix, Prime Video, Disney+, and more this weekend (July 17)

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

‘God of War’ to Recast Kratos Role Following Ryan Hurst’s On-Set Injury

‘Dhurandhar 2’ Drives India’s Record First-Half 2026 Box Office

Mandy Moore Shares Look at “Loose Skin” After Welcoming 3 Kids

“Wouldn’t Do That Again”: Jim Parsons Reveals The Harsh Truth About The Big Bang Theory’s Experience

Lettuce from Mexico sold at Taco Bells in 5 states identified as source of diarrhea-causing parasite

MakeMyTrip India files confidential IPO DRHP with Sebi. Check details

Aston Martin addresses reports on debt financing discussions

Trump is selling millisecond access to his Truth Social blasts — and traders are already lining up

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

‘Current LLMs introduce substantial errors when editing work documents’: Microsoft scientists find most AI models struggle with long-running tasks — so maybe don’t trust them completely just yet

Lettuce from Mexico sold at Taco Bells in 5 states identified as source of diarrhea-causing parasite

Surfshark’s FastTrack now covers more than 2.000 servers — your city might be on the list too

‘God of War’ to Recast Kratos Role Following Ryan Hurst’s On-Set Injury

MakeMyTrip India files confidential IPO DRHP with Sebi. Check details