More

    Microsoft researchers crack AI guardrails with a single prompt




    • Researchers were able to reward LLMs for harmful output via a ‘judge’ model
    • Multiple iterations can further erode built-in safety guardrails
    • They believe the issue is a lifecycle issue, not an LLM issue

    Microsoft researchers have revealed that the safety guardrails used by LLMs could actually be more fragile than commonly assumed, following the use of a technique they’ve called GRP-Obliteration.

    The researchers discovered that Group Relative Policy Optimization (GRPO), a technique typically used to improve safety, can also be used to degrade safety: “When we change what the model is rewarded for, the same technique can push it in the opposite direction.”


    https://cdn.mos.cms.futurecdn.net/cvUbbQwxuHbLsEVEuaWGcL-1350-80.jpg



    Source link

    Latest articles

    spot_imgspot_img

    Related articles

    Leave a reply

    Please enter your comment!
    Please enter your name here

    spot_imgspot_img