Scaling Monosemanticity: Extracting Interpretable Features from Claude

transformer-circuits.pub

Anthropic identifies millions of interpretable features inside Claude

“If we can understand what is happening inside these models, we can actually verify alignment claims”

0 comments

Join OpenLinq to join the discussion