[h] home [b] blog [n] notebook

the case for funding interpretability

a rough estimate: the major AI labs collectively spend something in the range of $10 billion per year on capabilities research — training runs, infrastructure, engineering, talent. the total spend on mechanistic interpretability globally — the project of actually understanding what these systems are computing — is probably in the range of $50-100 million. that's a ratio somewhere around 100-to-1 in favor of building versus understanding. this seems like it might be the wrong ratio.

let me make the case properly rather than just asserting it.

the effective altruism community has a framework for prioritizing causes: scale (how big is the problem), tractability (can we actually make progress), and neglectedness (how much is already being done). interpretability scores well on all three, which is part of why a disproportionate share of EA-aligned AI funding goes there, but which makes the overall funding disparity even more striking.

on scale: if we deploy AI systems in consequential domains — healthcare, infrastructure, financial systems, military applications — without understanding them, the failure modes are severe and potentially unrecoverable. understanding the mechanisms enables us to predict failure modes rather than discovering them in deployment. scale is large.

on tractability: this is where the case is strongest and most often underestimated. five years ago it was reasonable to wonder whether the internals of large neural networks were interpretable in principle — whether there was any structure to be found or just an inscrutable mass of correlations. the evidence since then suggests that there is structure. circuits exist. features are identifiable. sparse autoencoders can decompose superposed representations into interpretable components. the research is hard and the techniques don't yet scale to frontier models, but the basic empirical premise — that the thing we're trying to do is possible — looks much more solid than it did in 2019.

on neglectedness: this is the strongest argument for marginal funding. there are diminishing returns to pouring more money into capabilities research — you're adding to a large, well-organized field with strong incentives and institutional infrastructure. marginal spending on interpretability goes into a field with hundreds of researchers globally, not thousands, where individual contributions still shift the frontier meaningfully. the counterfactual impact of a dollar spent on interpretability is almost certainly higher than a dollar spent on capabilities, even if interpretability research is harder and riskier.

there's also a public goods problem that philanthropy is better positioned to solve than markets. an interpretability tool that decomposes LLM activations into interpretable features benefits every AI lab and every safety researcher and every academic working in the space. the individual labs have weak incentives to build it, because they'd be building infrastructure that helps their competitors. open-source interpretability tooling is exactly the kind of thing that won't be built at adequate scale by the market, and that philanthropic funders or government research agencies should want to fund.

the common objections: maybe interpretability won't scale to frontier models, maybe the systems are too complex to be human-interpretable in any meaningful sense, maybe we're running out of time and should focus on near-term safety interventions. these are real concerns. i think they're outweighed by the potential upside and the current extreme neglect. even if interpretability gives us 50% of what a complete understanding would give us, that's enormously valuable — and we're currently getting close to 0% because we're barely trying.

the time pressure objection is worth addressing directly. yes, AI capabilities are advancing rapidly and interpretability is a long-term research program. this is an argument for moving faster on interpretability, not an argument for doing less of it. you do both: near-term mitigation via alignment techniques and red-teaming and deployment restrictions, while simultaneously building the foundational understanding that eventually lets you do better than mitigation. treating interpretability as something to defer until we have more time is reasoning backwards.

the world is currently spending roughly $100 on capability for every $1 on understanding. this seems like it might be correctable. fund interpretability.