notes on superposition and polysemanticity

working notes on one of the central problems in mechanistic interpretability. this is the stuff i think about most and the area where i think breakthroughs matter the most.

the problem

neural networks have more concepts than neurons. a network with 512 dimensions might need to represent thousands of distinct features. how? by encoding features as directions in activation space rather than as individual neurons. multiple features share the same neurons. this is superposition.

polysemanticity is the flip side. a single neuron responds to multiple unrelated concepts. one neuron might activate for both "academic citations" and "dollar amounts." that's not a bug. it's a consequence of the network packing more information into fewer dimensions than a one-to-one mapping would allow.

why this matters

if every feature mapped to one neuron, interpretability would be easy. you'd just look at each neuron, figure out what it does, and you'd have a complete understanding of the network. superposition makes this impossible. the features are entangled. you can't read them off individual neurons.

this is arguably the core technical challenge of mech interp. most other problems (understanding circuits, tracing information flow, identifying algorithms) become much easier once you can cleanly extract individual features.

sparse autoencoders

the current best approach is training sparse autoencoders (SAEs) on model activations. the SAE learns to decompose the activation vector into a larger set of sparse features. each feature ideally corresponds to one interpretable concept.

this works surprisingly well. anthropic's work on Claude found thousands of interpretable features using this technique. features for specific concepts, entities, writing styles, code patterns. the decomposition isn't perfect. some features are still hard to interpret. but the hit rate is high enough to be useful.

open questions i'm thinking about: how do you evaluate whether an SAE has found the "true" features vs. artifacts? how do you handle features that are inherently compositional? what's the right sparsity penalty? how do you scale this to the largest models without the compute costs being prohibitive?

the compression hypothesis

my current mental model is that superposition is a form of lossy compression. the network needs to store more information than it has dimensions, so it compresses. the compression exploits statistical structure in the data. features that rarely co-occur can share dimensions because they rarely interfere.

this predicts that the amount of superposition should increase with the number of features relative to the number of dimensions, and decrease as features become more correlated. both of these seem to be true empirically. it also suggests that understanding the data distribution is key to understanding the superposition structure.

status: active research notes. updating frequently as i read new papers and run experiments. this is the area i want to contribute to most.