interpretability and the market for model trust

george akerlof's 1970 paper on the market for lemons is about used cars, but the argument is general. when sellers know more about product quality than buyers, and buyers can't tell quality apart, the high-quality supply leaves the market. this is the classic asymmetric-information failure: information asymmetry collapses the trading surplus. everyone ends up worse off, including the sellers who would have been happy to transact at an honest price.

the market for ai models has the same structure. a lab trains a model. the lab knows a lot about what is inside — the data, the training setup, the red-team reports, the known failure modes. a downstream user knows almost nothing. when the downstream user tries to deploy the model in a high-stakes context, they face a lemons problem: the lab claims the model is fine; the user has no way to verify the claim; the user rationally discounts the claim; the lab has no way to credibly signal that its model is actually good. high-stakes deployment gets suppressed. low-stakes deployment continues, because the cost of a bad model in a chat interface is low enough that the user does not need to verify. the deployment equilibrium collects at the low-stakes end of the distribution, for exactly the same reason akerlof's used-car market collects around low-quality cars.

mechanistic interpretability is the field that fixes this. i do not think this is how interp researchers typically describe their work. they describe it in safety or scientific terms — understanding what neural networks are doing, preventing catastrophic failure modes, building a science of ai. those are correct and also a subset of the economic picture. the economic picture is that interpretability is a third-party audit technology. it converts opaque claims about model quality into verifiable statements, which is exactly the mechanism akerlof-style markets need to avoid collapse.

thinking about interp as an audit technology changes several things about how you allocate funding for it. first, the social value of interpretability is proportional to the stakes of model deployment. a lemons market on chatbots has small welfare loss. a lemons market on medical triage or legal advice has enormous welfare loss. interpretability funding should concentrate where deployment stakes are highest, which is exactly where the current funding picture does not live. second, the willingness-to-pay for interpretability tools should be a product sold to deployers, not to labs. the deployer is the principal with the information-asymmetry problem. the lab is the agent whose claims the deployer cannot verify. audit markets in the real world — financial statement audits, safety-testing labs, ul listings — are paid by the party that needs to trust the claim, not the party making it.

third, and most importantly, interpretability tools have value only if they actually deliver audit-level confidence. "i probed the model and it seems to have a refusal direction" is not an audit. "i can decompose this model's behavior into a set of monosemantic features, and i can demonstrate that none of them correspond to the specific failure mode the deployer is worried about" is an audit. the field is currently producing the first kind of result abundantly and the second kind rarely. the gap is not a scientific one — the techniques exist in principle — it is an operational one. auditing requires a repeatable, scalable, falsifiable methodology that produces results a third party will stake their reputation on. building that is a very different project from publishing papers.

this frame also gives you a clean prediction about which interpretability techniques get adopted. the ones that survive will be the ones that produce audit-quality outputs — sparse autoencoders that decompose activations into nameable features, causal-scrubbing-style verifications of hypothesized mechanisms, attribution methods that ground specific behaviors in specific circuits. the ones that do not survive will be the ones that produce interesting-but-not-rigorous research — ad hoc probing, neuron naming without reproducibility, saliency maps. not because the latter are uninteresting; because they do not solve the akerlof problem.

the market-for-trust framing also suggests that interpretability funding should come from deployers and regulators, not labs. the incentive alignment is cleaner. labs funding their own interpretability has the structural problem every internal audit has: the auditor works for the auditee, and the auditor's continued employment depends on the auditee's continued operation. a third-party interpretability firm paid by deployers to certify model behavior is the structurally correct institution. nothing like this exists yet, but the economic pressure to build it is obvious as soon as you look. two trends guarantee it: frontier models will be deployed in higher-stakes settings, raising the welfare loss of the lemons equilibrium; and regulation will demand verifiable claims about model behavior, which labs cannot produce internally without losing the credibility that third-party audit provides.

the weird thing, to me, is that the interpretability community has spent the last five years quietly doing the research for an industry that does not yet exist. once the industry shows up — around 2025 or 2026, i would guess, driven by either a major deployment failure or a regulatory forcing function — the labor market for interp researchers will look nothing like it does now. right now most interp is in labs. soon, much of it will be in audit firms, insurance carriers, regulated utilities, and hospital systems. the people who pay for it will be different, and what they ask for will be different.

this is a long-winded way of saying interp is a market-infrastructure technology, not a pure-research one. the pure research will continue — there is still a lot we do not know about how neural networks represent things — but the social return on it will come from the audit market that consumes it, not from the papers themselves. if you are a funder and you are trying to decide what to prioritize, this suggests a very different portfolio than what most grant programs are currently writing checks for.