here is a question that does not get asked enough: a transformer has, say, 768 residual channels, and it seems to represent many thousands of distinct concepts. how does that work? the naive assumption — one neuron, one feature — falls apart the moment you probe any nontrivial network. individual units light up for unrelated things. concepts you can identify linearly do not correspond to any single basis direction. the representation is doing more with less, and nobody quite has a clean story for how.
the cleanest story, i think, comes from an unlikely place: compressed sensing. the field that lets mri machines reconstruct images from a tenth of the samples, that lets us decode sparse signals from noisy projections, that underwrites half of modern statistical signal processing. its central result is that you can losslessly store a k-sparse vector of dimension n in something like k log(n/k) measurements, provided the measurement matrix is nice. the neural net has the same shape of problem. it has more features to represent than it has dimensions to put them in. most features are off most of the time. that is sparsity. it should pack them in at a rate governed by compressed sensing's basic scaling law, not by the pigeonhole principle.
if this is right, a neural net does not choose between "monosemantic, one-feature-per-neuron" and "polysemantic, everything is a mess." it moves along a continuum set by a capacity constraint. think of it as a lagrangian: the training objective wants many features represented, but each additional feature packed into the same dimensions creates interference, and that interference costs loss. at the optimum, the marginal feature is added only when its contribution to loss-reduction exceeds the interference tax it imposes on every feature already in the representation.
this framing makes a few predictions that a raw "polysemantic neurons are mysterious" story does not. first, there should be a phase transition. as a feature becomes more important or more common, at some threshold it becomes worth representing in its own direction rather than sharing. before the threshold, it rides a superposition of directions, and individual neurons that participate in it will also participate in many other things. after the threshold, it gets a basis of its own. you should be able to see this transition empirically if you vary feature frequencies and watch what happens to interpretability probes.
second, sparsity is the control variable, not the outcome. if the training distribution pushes the network toward denser activations, superposition gets worse and the individual neurons get less legible. if the distribution is sparser, superposition gets tighter, interference is cheaper, and the same dimensionality can carry more features cleanly. the implication for anyone trying to read a network is that you want to study it where its activations are actually sparse — and most activations are actually sparse — rather than in regimes where everything is firing.
third, and this is the part i keep coming back to, superposition is a mechanism-design problem rather than a discovery problem. the gradient-descent training loop behaves like a market with no pricing of the externality: when feature f claims a new direction, it imposes interference on every other feature that shares that direction, but nothing in the loss function forces f to compensate them. the network finds a corner of feature space where the externalities are small, which is why features come out entangled in apparently irrational ways. if you wanted to clean this up, you would not "discover" the features — you would redesign the mechanism so that feature directions had to bid against each other for channel capacity. overcomplete dictionaries with sparsity penalties start to look a lot like pigouvian taxes on interference.
the other reason i find this framing useful is that it predicts where we will be in five years. if features are packed under a compressed-sensing law, scaling a network is not "adding linear capacity" — it is widening the channel through which an essentially sublinear number of features is funneled. bigger models will have cleaner, more monosemantic features per dollar of dimension, because their capacity budget can afford to give more features their own direction. interpretability should, in this view, get easier as models get larger, which is the opposite of the "larger models are more opaque" intuition and which i suspect is correct.
there is a lot of hand-waving here. the compressed-sensing bound assumes random measurement matrices, which trained weights are not. the loss function does not explicitly include interference cost, so the lagrangian interpretation is a just-so story you would have to back up with a training dynamics argument. and none of this tells you what the features actually are; it only tells you why they should be smeared across neurons rather than assigned to them. but i think that is already useful. it suggests that the right approach to interpretability is neither "find the feature in this neuron" nor "give up, it is a soup." it is to accept that the network is doing compression, figure out the basis it is compressing in, and read it from there.
a note on priors. economists should find this story extremely familiar, because it is structurally identical to hotelling's location problem and a dozen other capacity-allocation setups where agents compete for scarce shared resources. if you have spent any time with lagrangians, shadow prices, or second-best pricing of shared infrastructure, the phenomenology of superposition reads like a paper in your field with different variable names. i suspect the people who eventually nail down the formal theory of superposition will come from an econometrics or compressed-sensing background rather than a pure ml one.