Mechanistic Interpretability

Circuits

there was that one paper ages ago by rowan wang, never actually read it. IOI work

Sparse Autoencoders

Someone, I think David Krueger had a take that the waluigi effect was fake lesswrong thinking without any grounding. I thought the take was correct when I read it, and then saw this and now think it’s more plausibly true.

Representation Engineering

How similar is reading to probing?

LoRRA

Ends up being surprisingly effective, imo