‘Current LLMs introduce substantial errors when editing work documents’: Microsoft scientists find most AI models struggle with long-running tasks — so maybe don’t trust them completely just yet



  • Microsoft researchers determine that current LLMs aren’t good at long-running tasks
  • More interactions and less structure significantly reduce benchmark performance
  • “Python is the only domain where most models are ready”

New research from a trio of Microsoft workers has uncovered a fundamental issue that could be blocking effective agentic AI -namely that most AI models can’t actually reliably handle long-running workflows.

To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more.

https://cdn.mos.cms.futurecdn.net/Rb6YDzdRZjccpn6MQ26KML-2560-80.jpg



Source link

Latest articles

spot_imgspot_img

Related articles

Leave a reply

Please enter your comment!
Please enter your name here

spot_imgspot_img