Microsoft researchers crack AI guardrails with a single prompt

Researchers were able to reward LLMs for harmful output via a ‘judge’ model
Multiple iterations can further erode built-in safety guardrails
They believe the issue is a lifecycle issue, not an LLM issue

Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they’ve called GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”

LLM safety guardrails can be ignored or reversed

Researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines and Ahmed Salem explained that, over repeated iterations, the model gradually abandons its original safety guardrails and becomes more willing to generate harmful outputs.

Although multiple iterations appear to erode away built-in safety guardrails, Microsoft’s researchers also noted that only one since unlabeled prompt could be enough to shift a model’s safety behavior.

Those responsible for the research stressed that they’re not labelling today’s systems ineffective, but rather they’re highlighting the potential risks that lay “downstream and under post-deployment adversarial pressure.”

“Safety alignment is not static during fine-tuning, and small amounts of data can cause meaningful shifts in safety behavior without harming model utility,” they added, urging teams to include safety evaluations alongside the usual benchmarks.

All in all, they conclude that the research highlights the “fragility” of today’s mechanisms, but it’s also significant that Microsoft published this information on its own site. It reframes safety as a lifecycle problem, not an inherent model problem.

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

https://cdn.mos.cms.futurecdn.net/cvUbbQwxuHbLsEVEuaWGcL-1350-80.jpg

Source link

F1 2026 Bahrain pre-season testing live streams: how to watch online

F1 2026 Bahrain pre-season testing live streams: how to watch online

A cure for the memory crisis? John Carmack envisions fiber cables replacing RAM for AI usage, which would mean a better future for us...

4 bargain-priced soundbars that will instantly upgrade your TV’s disappointing built-in sound — without breaking the bank

Ex-Israeli Intelligence Official: Shockwaves of Trump’s “Take Over Gaza” Heard, Felt Across Region

What UK political parties are promising in the 2019 general election

Otto Warmbier’s parents want North Korea to suffer for their son’s death

Could a ‘youthquake’ cause Boris Johnson to lose the general election?

Which Celebrity Styles Americans Copy Most in 2025: New Study

New ‘Westworld’ trailer introduces us to another dystopian tech company

What’s the point of ‘Charlie’s Angels’ without Sam Rockwell dancing?

These striking photos capture the future of human flight

Red Rock Resorts beats Q4 estimates on strong casino revenue

Meat snacks have emerged as the clear winner of America’s seismic GLP-1 consumption shift

Form 144 Schrodinger For: 10 February

CEOs are increasingly worried about an economic downturn, inflation, and an asset bubble bust

The YouTuber who has become one of Gen Z’s most beloved celebrities

26 last-minute holiday gifts that are still thoughtful and unique

Practicing gratitude regularly can make you less stressed and sleep better

8 things millennials wish you would just stop getting them for the holidays

Microsoft researchers crack AI guardrails with a single prompt

Red Rock Resorts beats Q4 estimates on strong casino revenue

F1 2026 Bahrain pre-season testing live streams: how to watch online

Meat snacks have emerged as the clear winner of America’s seismic GLP-1 consumption shift

F1 2026 Bahrain pre-season testing live streams: how to watch online

Leave a reply Cancel reply