Mechanistic Interpretability
Circuits
there was that one paper ages ago by rowan wang, never actually read it. IOI work
Sparse Autoencoders
Someone, I think David Krueger had a take that the waluigi effect was fake lesswrong thinking without any grounding. I thought the take was correct when I read it, and then saw this and now think it’s more plausibly true.
Representation Engineering
How similar is reading to probing?
LoRRA
Ends up being surprisingly effective, imo