Microsoft researchers crack AI guardrails with a single prompt

Researchers were able to reward LLMs for harmful output via a ‘judge’ model
Multiple iterations can further erode built-in safety guardrails
They believe the issue is a lifecycle issue, not an LLM issue

Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they’ve called GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”

LLM safety guardrails can be ignored or reversed

Researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines and Ahmed Salem explained that, over repeated iterations, the model gradually abandons its original safety guardrails and becomes more willing to generate harmful outputs.

Although multiple iterations appear to erode away built-in safety guardrails, Microsoft’s researchers also noted that only one since unlabeled prompt could be enough to shift a model’s safety behavior.

Those responsible for the research stressed that they’re not labelling today’s systems ineffective, but rather they’re highlighting the potential risks that lay “downstream and under post-deployment adversarial pressure.”

“Safety alignment is not static during fine-tuning, and small amounts of data can cause meaningful shifts in safety behavior without harming model utility,” they added, urging teams to include safety evaluations alongside the usual benchmarks.

All in all, they conclude that the research highlights the “fragility” of today’s mechanisms, but it’s also significant that Microsoft published this information on its own site. It reframes safety as a lifecycle problem, not an inherent model problem.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

https://cdn.mos.cms.futurecdn.net/cvUbbQwxuHbLsEVEuaWGcL-1350-80.jpg

Source link

Plastic waste could soon fuel aircraft as researchers develop cheaper jet fuel from discarded materials using a new reactor system

Horipad for Steam review: just get a Steam Controller instead

How to watch Uruguay vs Cape Verde: World Cup 2026 Free Streams & TV Channels

How to watch Uruguay vs Cape Verde: World Cup 2026 Free Streams & TV Channels

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

The Best Sex And The City Successor Isn’t And Just Like That…

Steven Spielberg Was Denied the Opportunity to Direct an Iconic Spy Franchise

Get a Brand New 77″ Samsung 4K OLED Smart TV for Just $1,098 During the Amazon Prime Day Sale

Iowa Anchor Announces He’s “Stepping Away From The News Industry”

Meet Labour’s ‘King of the North,’ the 56-year-old representing Greater Manchester and challenging Keir Starmer

Meet Labour’s ‘King of the North,’ the 56-year-old representing Greater Manchester and challenging Keir Starmer

The union behind California’s billionaire tax is blinking, but Gavin Newsom wants to inflict total defeat

A stock trader’s guide to navigating a rare ‘Super El Niño’

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

Microsoft researchers crack AI guardrails with a single prompt

The Best Sex And The City Successor Isn’t And Just Like That…

Meet Labour’s ‘King of the North,’ the 56-year-old representing Greater Manchester and challenging Keir Starmer

Plastic waste could soon fuel aircraft as researchers develop cheaper jet fuel from discarded materials using a new reactor system

Steven Spielberg Was Denied the Opportunity to Direct an Iconic Spy Franchise