Microsoft researchers crack AI guardrails with a single prompt




  • Researchers were able to reward LLMs for harmful output via a ‘judge’ model
  • Multiple iterations can further erode built-in safety guardrails
  • They believe the issue is a lifecycle issue, not an LLM issue

Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they’ve called GRP-Obliteration.

The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”


https://cdn.mos.cms.futurecdn.net/cvUbbQwxuHbLsEVEuaWGcL-1350-80.jpg



Source link

Latest articles

spot_imgspot_img

Related articles

spot_imgspot_img