Samsung’s TRUEBench benchmark puts AI chatbots on trial to see if they’re ready to replace real workers in everyday offices

Samsung TRUEBench subjects AI chatbots to strict rules with no partial credit
Samsung uses 2,485 tests across languages to mimic office workloads
Inputs range from short prompts to documents over twenty thousand characters

The adoption of AI tools in workplaces has grown rapidly, raising concerns not only about automation but also about how these systems are judged.

Until now, most benchmarks have been narrow in scope, testing AI writers and AI chatbot systems with simple prompts that rarely resemble office life.

Samsung has stepped into this debate with TRUEBench, a new framework it says is designed to track whether AI models can handle tasks which resemble actual work.

Testing AI in the workplace

TRUEBench, short for Trustworthy Real-world Usage Evaluation Benchmark, contains 2,485 test sets spread across ten categories and twelve languages.

Unlike conventional benchmarks which focus on one-off questions in English, it introduces longer, more complex tasks such as multi-step document summarization and translation across multiple languages.

Samsung says inputs vary from a handful of characters to over twenty thousand, an attempt to reflect both quick requests and long reports.

The company argues these test sets expose the limits of AI chatbot platforms when they face real-world conditions rather than classroom-style queries.

Each test has strict requirements: unless all specified conditions are met, the model fails – this produces results that are demanding and less forgiving than many existing benchmarks, which often credit partial answers.

“Samsung Research brings deep expertise and a competitive edge through its real-world AI experience,” said Paul (Kyungwhoon) Cheun, CTO of the DX Division at Samsung Electronics and Head of Samsung Research.

“We expect TRUEBench to establish evaluation standards for productivity and solidify Samsung’s technological leadership.”

Samsung Research outlines a process where humans and AI cooperate in designing the evaluation criteria.

Human annotators first set the conditions, then AI reviews them to detect contradictions or unnecessary constraints.

The criteria are refined repeatedly until they are consistent and precise.

Automatic scoring is then applied to AI models, minimizing subjective judgments and making comparisons more transparent.

One of the unusual aspects of TRUEBench is its publication on Hugging Face, where leaderboards allow direct comparison of up to five models.

In addition to performance scores, Samsung also discloses the average response length, a metric that helps weigh efficiency alongside accuracy.

The decision to open parts of the system suggests a push for credibility, although it also exposes Samsung’s approach to scrutiny.

Since the advent of AI, many workers already wonder how productivity will be measured when AI systems are given similar responsibilities.

With TRUEBench, managers can have a way to judge if an AI chatbot can replace or supplement staff.

Yet despite its ambitions, benchmarks, however broad, are still synthetic measures and cannot fully capture the messiness of workplace communication or decision-making.

TRUEBench may set higher standards for evaluation, but whether it can resolve fears of job displacement, or simply sharpen them, remains an open question.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

You may also like

https://cdn.mos.cms.futurecdn.net/8SqAJvbDFGjZvgNXV8Wc8Y-1920-80.jpg

Source link

Celebrate the weekend in the perfect way by picking up an 80% discount on your first two months of Paramount Plus Premium

A CyberSUV could be on the way — as Elon Musk claims that Tesla is working on ‘something way cooler than a minivan’

A CyberSUV could be on the way — as Elon Musk claims that Tesla is working on ‘something way cooler than a minivan’

Claude is limiting usage more aggressively during peak hours — here’s what changed

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Which Celebrity Styles Americans Copy Most in 2025: New Study

New ‘Westworld’ trailer introduces us to another dystopian tech company

What’s the point of ‘Charlie’s Angels’ without Sam Rockwell dancing?

These striking photos capture the future of human flight

Jangada Mines completes drilling at Molly gold project in Brazil

Australia stocks lower at close of trade; S&P/ASX 200 down 0.65%

BlackRock funds provide about $57 million to IQM Quantum Computers ahead of US IPO

Nike’s China stumble exposes execution gaps

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

Samsung’s TRUEBench benchmark puts AI chatbots on trial to see if they’re ready to replace real workers in everyday offices

Jangada Mines completes drilling at Molly gold project in Brazil

Australia stocks lower at close of trade; S&P/ASX 200 down 0.65%

BlackRock funds provide about $57 million to IQM Quantum Computers ahead of US IPO

Nike’s China stumble exposes execution gaps