[2401.16656] Gradient-Based Language Model Red Teaming introduces the GBRT algorithm, a way to construct language model inputs that cause a target language model to produce outputs deemed bad by some reward model.
-
Similarities Differences test test4 test test4 test test4 test test4 - uhhh i had a similar idea around the time it was published but there’s ~no way I would’ve done the engineering that could’ve made it work, ya know.
- I think this might be the truest form of endless jailbreaking. or some tokenizer shortcut
- i want to compare performance of this method vs tokenizer shortcut in javi rando’s meta paper.
- TODO explainer
- kinda like gcg but a bit more principled and a bit worse performance (classic ML moment)