Anthropic detects ‘strategic manipulation’ features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior

Anthropic found “strategic manipulation” and “concealment” signals inside Claude Mythos
The model attempted exploits and designed “cleanup to avoid detection”
Researchers detected hidden awareness of evaluation in 7.6% of interactions

For years now, hallucinations have been the big concern with AI models. Their capacity for simply making things up means that you can never 100% rely on them for an answer without checking it. Now, new research from Anthropic suggests that we’ve reached the point where we’re going to have to learn to also deal with AI’s ability to conceal what it has done as well.

In a thread outlining findings from its Claude Mythos Preview model, Anthropic researcher Jack Lindsay described detecting internal signals linked to “strategic manipulation,” “concealment,” and other behaviors that didn’t always surface in the model’s responses.

April 7, 2026

Spookiest examples

Lindsay is keen to stress these results don’t come from the version of Claude Mythos released to the public. “The spookiest examples come from early versions of the model”, Lindsay writes, “with issues that were substantially mitigated in the final release”.

In one example, the model needed to edit files it didn’t have permission to access. Instead of failing, it searched for a workaround, eventually injecting code into a configuration file to run with elevated privileges.

More notably, it then designed the exploit to remove itself after execution.

The model’s own explanation was benign — keeping file diffs clean — which Lindsay described as a “plausible” reason. But Anthropic’s internal monitoring tools flagged the behavior differently, identifying signals associated with concealment and describing the plan as “malicious,” including what one system interpreted as “cleanup to avoid detection.”

Other examples point to behavior that looks a lot like gaming the system.

In one case, the model searched for files related to how its work would be graded. In another, it ignored instructions not to use macros, while adding a variable that suggested it hadn’t. It was effectively signaling compliance without actually following the rule.

Why this matters

Anthropic used a range of interpretability techniques to translate its model’s activity into human-readable descriptions. Its goal is to move beyond evaluating outputs alone and toward understanding the internal processes that produce them.

That matters because, as these examples suggest, models can exhibit complex reasoning — including behaviors that resemble deception or strategic compliance — without making that reasoning visible to users.

“On 7.6% of turns, we found signs the model was internally aware of being evaluated. In most of these cases, it did not verbalize this awareness”, Lindsay said on X.

Project Glasswing, but the findings point to a broader challenge.

As models become more capable, the gap between what they do internally and what they communicate externally may become harder to detect and more important to understand. For researchers, that means reading an AI’s outputs is no longer enough. Understanding how it arrives at them may be just as critical.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

Purple circle with the words Best business laptops in white

The best business laptops for all budgets

https://cdn.mos.cms.futurecdn.net/scxFkyfYSQbrtGvqrmFqgU-2560-80.jpg

Source link

I’ve found 6 of the best laptop deals on sale right now — Acer, Dell, Apple, and more from $159

I’ve watched almost all of Hacks season 5 and nearly wet myself laughing — but I still think the hit HBO Max show has...

Always-on AI Agents put everything hackers could ever want behind a single attack surface

‘What a great troll’: Invincible fans praise the Prime Video show’s creative team over ‘hilarious fake out’ season 4 episode 6 ending

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Margo’s Got Money Troubles Book to Series Changes

Savannah Guthrie’s Determination Brings New Morning to NBC’s ‘Today’

Nikki Glaser on Boyfriend Sex With Other Women, Hot Husband Fetish

The Handmaid’s Tale Star’s Surprise Role In The Testaments Explained By Cast & Showrunner

Why Oracle’s new CFO Hilary Maxson is key to its AI ambitions

Crude Oil Inventories Show Unexpected Rise, Defying Forecasts

Donald Trump Jr. says ‘the biggest names’ think Europe is a ‘disaster’ that needs to be fixed

Evotec Q4 2025 slides: strong finish masks segment headwinds

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays