Anthropic detects ‘strategic manipulation’ features in Claude Mythos, including exploit attempts and hidden evaluation awareness — prompting concern over model behavior



  • Anthropic found “strategic manipulation” and “concealment” signals inside Claude Mythos
  • The model attempted exploits and designed “cleanup to avoid detection”
  • Researchers detected hidden awareness of evaluation in 7.6% of interactions

For years now, hallucinations have been the big concern with AI models. Their capacity for simply making things up means that you can never 100% rely on them for an answer without checking it. Now, new research from Anthropic suggests that we’ve reached the point where we’re going to have to learn to also deal with AI’s ability to conceal what it has done as well.

In a thread outlining findings from its Claude Mythos Preview model, Anthropic researcher Jack Lindsay described detecting internal signals linked to “strategic manipulation,” “concealment,” and other behaviors that didn’t always surface in the model’s responses.


https://cdn.mos.cms.futurecdn.net/scxFkyfYSQbrtGvqrmFqgU-2560-80.jpg



Source link

Latest articles

spot_imgspot_img

Related articles

Leave a reply

Please enter your comment!
Please enter your name here

spot_imgspot_img