Surprisingly enough, it seems some AI agents aren’t quite up to scratch on some basic business tests

Salesforce research finds single-turn tasks see only 58% success, while multi-turn effectiveness drops to 35%
Reasoning models like gemini-2.5-pro tend to outperform lighter models
CRMArena-Pro has proven to be a challenging benchmark

Researchers from Salesforce AI Research have introduced a new benchmark – CRMArena-Pro – which uses synthetic enterprise data to access LLM agent performance in difference CRM scenarios.

It found LLM agents achieved around 58% success on tasks which can be completed in a single step, with tasks that require multiple interactions dropping in effectiveness to just 35% – barely more than one in three.

Although models like gemini-2.5-pro achieved over 83% success in workflow execution, the Salesforce researchers still highlighted some concerns with AI agents, suggesting they might not quite be up to scratch after all.

Are AI agents actually that good?

The paper, entitled ‘Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions’, explained that LLM agents displayed near-zero inherent confidentiality awareness, noting that their performance in handling sensitive information is only improved with explicit prompting (which often came at the expense of task success).

They also criticized previous and existing benchmarks for failing to capture multi-turn interactions, addressing B2B scenarios or confidentiality, and reflecting realistic data environments. CRMArena-Pro is build on synthetic data validated by CRM experts, covering B2B and B2C settings.

In terms of analysis results, reasoning models like gemini-2.5-pro and o1 outperformed lighter models most of the time – Salesforce’s researchers concluded that models that seek more clarifications generally perform better, especially in multi-turn tasks.

For example, while the average performance across the nine models tested (three each from OpenAI, Google and Meta) resulted in a score of 35.1%, gemini-2.5-pro scored 54.5%.

“These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios, positioning CRMArena-Pro as a challenging testbed for guiding future advancements in developing more sophisticated, reliable, and confidentiality-aware LLM agents for professional use,” the researchers concluded.

Looking ahead, Salesforce CEO Marc Benioff views AI agents as a high-margin opportunities, with major corporate clients including governments betting on AI agents for boosted efficiency and further cost savings.

https://cdn.mos.cms.futurecdn.net/cuJ2nHdA2cLngX4bhsHsye.jpg

Source link

‘God-tier battery life’: Dell XPS 14 lasts 43 hours in longevity test that shows the laptop leaves Apple’s MacBook Air M5 in the dust

Geekom mini PC deals: Our top-performers in the Easter sale

Dell’s new 14-inch Pro Premium delivers workstation-level performance in a surprisingly lightweight and travel-friendly package for busy executives

Amazon is slashing prices on best-selling running shoes for spring — Nike, Hoka, and New Balance from $55

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

The Masked Singer Season 14 Runner-Up Phillip Phillips Reveals How Performing As Pugcasso Compared To American Idol & Discusses His New Single

‘The Pitt’ Loses One of Its Best Characters Ahead of Season 3

HBO Set to Debut Harry Potter TV Series Documentary Next Week

When Does Her Third Record Come Out? – Hollywood Life

Horizon Kinetics buys Texas Pacific Land (TPL) share for $454

Markets rally hard on Iran’s promise to play nice at Hormuz

Trump fires Pam Bondi as US attorney general, White House official says

Gen Z millionaires are rushing into crypto—and they blame the risky bet on FOMO, fear of missing out

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

Surprisingly enough, it seems some AI agents aren’t quite up to scratch on some basic business tests

‘God-tier battery life’: Dell XPS 14 lasts 43 hours in longevity test that shows the laptop leaves Apple’s MacBook Air M5 in the dust

The Masked Singer Season 14 Runner-Up Phillip Phillips Reveals How Performing As Pugcasso Compared To American Idol & Discusses His New Single

Horizon Kinetics buys Texas Pacific Land (TPL) share for $454

Geekom mini PC deals: Our top-performers in the Easter sale