Mechanistic Interpretability

Circuits

there was that one paper ages ago by rowan wang, never actually read it. IOI work

Sparse Autoencoders

Someone, I think David Krueger had a take that the waluigi effect was fake lesswrong thinking without any grounding. I thought the take was correct when I read it, and then saw this and now think it’s more plausibly true.

Representation Engineering

How similar is reading to probing?

LoRRA

Ends up being surprisingly effective, imo

.arunim.fyi

Table of Contents

interpretability

Mechanistic Interpretability

Circuits

Sparse Autoencoders

Representation Engineering

LoRRA

Graph View

Backlinks