[2401.16656] Gradient-Based Language Model Red Teaming introduces the GBRT algorithm, a way to construct language model inputs that cause a target language model to produce outputs deemed bad by some reward model.

  • SimilaritiesDifferences
    testtest4
    testtest4
    testtest4
    testtest4
  • uhhh i had a similar idea around the time it was published but there’s ~no way I would’ve done the engineering that could’ve made it work, ya know.
  • I think this might be the truest form of endless jailbreaking. or some tokenizer shortcut
  • i want to compare performance of this method vs tokenizer shortcut in javi rando’s meta paper.
  • TODO explainer
  • kinda like gcg but a bit more principled and a bit worse performance (classic ML moment)