Surprisingly enough, it seems some AI agents aren’t quite up to scratch on some basic business tests

Salesforce research finds single-turn tasks see only 58% success, while multi-turn effectiveness drops to 35%
Reasoning models like gemini-2.5-pro tend to outperform lighter models
CRMArena-Pro has proven to be a challenging benchmark

Researchers from Salesforce AI Research have introduced a new benchmark – CRMArena-Pro – which uses synthetic enterprise data to access LLM agent performance in difference CRM scenarios.

It found LLM agents achieved around 58% success on tasks which can be completed in a single step, with tasks that require multiple interactions dropping in effectiveness to just 35% – barely more than one in three.

Although models like gemini-2.5-pro achieved over 83% success in workflow execution, the Salesforce researchers still highlighted some concerns with AI agents, suggesting they might not quite be up to scratch after all.

Are AI agents actually that good?

The paper, entitled ‘Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions’, explained that LLM agents displayed near-zero inherent confidentiality awareness, noting that their performance in handling sensitive information is only improved with explicit prompting (which often came at the expense of task success).

They also criticized previous and existing benchmarks for failing to capture multi-turn interactions, addressing B2B scenarios or confidentiality, and reflecting realistic data environments. CRMArena-Pro is build on synthetic data validated by CRM experts, covering B2B and B2C settings.

In terms of analysis results, reasoning models like gemini-2.5-pro and o1 outperformed lighter models most of the time – Salesforce’s researchers concluded that models that seek more clarifications generally perform better, especially in multi-turn tasks.

For example, while the average performance across the nine models tested (three each from OpenAI, Google and Meta) resulted in a score of 35.1%, gemini-2.5-pro scored 54.5%.

“These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios, positioning CRMArena-Pro as a challenging testbed for guiding future advancements in developing more sophisticated, reliable, and confidentiality-aware LLM agents for professional use,” the researchers concluded.

Looking ahead, Salesforce CEO Marc Benioff views AI agents as a high-margin opportunities, with major corporate clients including governments betting on AI agents for boosted efficiency and further cost savings.

https://cdn.mos.cms.futurecdn.net/cuJ2nHdA2cLngX4bhsHsye.jpg

Source link

A basic security flaw let a security researcher access internal FIFA systems — and the ability to control World Cup TV streams

This Dell laptop deal is the fastest you can buy under $500

Quordle hints and answers for Thursday, June 18 (game #1606)

NYT Strands hints and answers for Thursday, June 18 (game #837)

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

‘The Mask’ Director Says He “Bet the Farm” on Jim Carrey Becoming a Movie Star [Exclusive]

LEGO Reveals a Working Pinball Machine You Can Build Yourself

Daveigh Chase Cause of Death Updates: How Did She Die? – Hollywood Life

Knicks Will Visit Trump At White House To Celebrate Championship: Dolan

Cathie Wood’s ARK sells Robinhood and Roku stock, buys Eli Lilly

Warsh kicks off Fed chief era with sweeping review as rates remain unchanged

Fed begins Warsh era with rate hold, sees hike possible later in 2026

Fed’s Warsh flags new tasks forces to study Fed operations

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

Surprisingly enough, it seems some AI agents aren’t quite up to scratch on some basic business tests

‘The Mask’ Director Says He “Bet the Farm” on Jim Carrey Becoming a Movie Star [Exclusive]

Cathie Wood’s ARK sells Robinhood and Roku stock, buys Eli Lilly

A basic security flaw let a security researcher access internal FIFA systems — and the ability to control World Cup TV streams

LEGO Reveals a Working Pinball Machine You Can Build Yourself