researchers University of Washington, pennsylvania state universityand the Allen AI Institute It’s open sourced safe decoding, a technique for protecting large-scale language models (LLMs) from jailbreak attacks. SafeDecoding outperforms baseline jailbreak defenses without incurring significant computational overhead.
A key insight of SafeDecoding is that during decoding, despite the high probability of a harmful response token to a jailbreak attack, a safe response token is still one of the most likely. Therefore, to guide the generated responses in a safe direction, SafeDecoding identifies safe response tokens and amplifies their probability, while reducing the probability of harmful responses. The researchers applied SafeDecoding to five open-source LLMs and evaluated its performance against six different jailbreak attacks compared to six baseline defense methods. SafeDecoding outperformed the baseline in almost all scenarios. According to the research team,
The main goal is [our work] Our goal is to enhance the security of LLM by developing a new lightweight decoding strategy. As LLMs are increasingly used in real-world applications, ensuring their safety has become important. We empirically demonstrate that the decoding strategy we developed not only effectively mitigates jailbreak attacks, but also allows LLMs to continue serving innocent users in an efficient and useful manner. I am.
With the release of ChatGPT and GPT-4, many techniques for jailbreaking LLM have emerged. These techniques consisted of prompts that forced the model to bypass safety devices and output potentially harmful responses. In 2023 InfoQ featured his NeMo Guardrails package for Nvidia that helps developers prevent his LLM risks. InfoQ also covered his LLM attack, an algorithm for constructing adversarial attacks created to help researchers understand and prevent attacks.
SafeDecoding is expert model, which is a fine-tuned version of Target LLM. Fine-tuning uses a dataset that researchers have built by asking LLM a harmful query. The dataset contains responses where the LLM rejected the prompt. The expert model is expected to behave similarly to the original LLM, but with improved ability to reject malicious prompts.
During inference, user prompts are passed to both the original model and the expert. Similar to the usual autoregressive decoding scheme, from the prompt both models produce the following set: top k Probably the next token. SafeDecoding takes the intersection of these two sets of tokens, takes the probability output from the original model, multiplies it by a constant value (1-α), and adds the probability from the expert and α. Probabilities are calculated. This effectively “amplifies” the tokens from the experts that represent safe responses and “attenuates” the original tokens that represent harmful responses.
SafeDecoding Architecture (Image source: SafeDecoding source code)
in discussion about work At X, co-author Bill Yuchen Lin was asked about the relationship between SafeDecoding and his previous work. UrialLLM alignment method:
Yes, the two works certainly share a common focus. That is, the distribution of tokens will change before and after tuning. The URIAL paper describes the BASE and ALIGNED models. Here at SafeDecoding, we will instead look at a commonly tuned model (e.g. Vicuna) vs a safety fine-tuned model (continuous tuning with more rejection examples). A key strategy is to scale up changes in token distribution to more effectively defend against jailbreaks.
of SafeDecoding source code Available on GitHub.