06 Aug 2023

Why Experiment Wins Underdeliver

A product team ships ten experiment wins in a quarter. Each one cleared the decision rule, each showed a lift on its primary metric, and no single analysis looked obviously broken. Then the quarter closes and the top-line number barely moves. That disappointment is common enough that it deserves a name.

An A/B test is one of the few instruments a product team has that can yield genuinely causal evidence. Random assignment, not the $p$ -value, gives the comparison its causal interpretation: this change, on average, moved this metric for the users represented by the analysis. Statistical significance says that, if the null model and test assumptions held, a difference this large or larger would be rare. That is why experiments are so valuable, and why the quarter’s disappointment usually traces to their limits rather than to the method itself.

The shortfall comes from the quiet leap between what a test estimates and what a launch delivers. A well-run experiment estimates the effect of one change on one metric over a short window for the population in the analysis, on the assumption that units do not interfere. Real benefit is broader: the effect of all changes together on the business, over the long run, for everyone. These are different quantities, and confusing them is an expensive mistake.

Much of the launch gap comes from violating one of those qualifiers. Closing it takes more than careful analysis of each test alone: a measurement system (holdouts, guardrails, reverse experiments) built to track what actually matters.

The Launch Gap

That shortfall is the launch gap: forecasted impact from shipped experiment readouts minus the aggregate impact later estimated at the level of the business. In practice, the second number should come from a holdout, reverse experiment, or other analysis that separates launch impact from seasonality, traffic mix, and unrelated changes. It has three broad sources:

Bias inside the experiment. The measurement was biased for its own estimand: an assumption failed, and the test did not cleanly estimate even what it claimed to. The winner’s curse, instrumentation bugs, and interference are the first kind.
The wrong quantity. The estimate is unbiased for what the experiment defined, but what the experiment defined is not the business outcome it was extrapolated to. Cannibalization, dilution, hidden costs, and short- versus long-term divergence are the second kind.
Effects that do not compose. Individual readouts may be reasonable on their own, but summing or launching them together changes the context. Non-additivity, saturation, and interaction effects are the third kind.

Bias Inside the Experiment

Selection and the Winner’s Curse

One cause is invisible inside any single experiment: teams do not launch a random sample of their experiments; they launch the ones that won. Conditioning on “statistically significant and positive” is a selection rule, and selection on a noisy estimate biases that estimate upward. This is regression to the mean in a business context; experimenters call it the winner’s curse.

The intuition: a measured effect is the true effect plus noise. Among the experiments that clear the significance bar, the ones where the noise happened to be favorable are over-sampled. The expected measured lift among shipped winners, $\mathbb{E}[\text{measured} \mid \text{shipped}]$ , therefore exceeds their expected true lift, $\mathbb{E}[\text{true} \mid \text{shipped}]$ . The effect is most severe when statistical power is low (small or borderline effects, the bulk of real product work): there, the surviving winners are dominated by lucky draws.

A short simulation, using a simplified normal estimator and absolute lift units, shows how large the bias can be. The particular numbers are illustrative; the point is the direction and rough magnitude of the selection effect:

import math
import random

random.seed(42)

n_experiments = 10_000
n_per_arm     = 5_000
sigma         = 1.0
se            = sigma * math.sqrt(2 / n_per_arm)

sum_meas = sum_true = n_ship = 0
for _ in range(n_experiments):
    has_effect = random.random() < 0.20
    true_lift  = random.gauss(0.02, 0.01) if has_effect else 0.0
    measured   = random.gauss(true_lift, se)
    if measured / se > 1.96:
        n_ship   += 1
        sum_meas += measured
        sum_true += true_lift

print(f"shipped {n_ship / n_experiments:.1%} of experiments")
print(f"avg measured lift among shipped: {sum_meas / n_ship:.4f}")
print(f"avg TRUE     lift among shipped: {sum_true / n_ship:.4f}")

With this seed, the simplified model produces:

shipped 6.1% of experiments
avg measured lift among shipped: 0.0500
avg TRUE     lift among shipped: 0.0178

The shipped winners report an average lift of 0.0500 but truly deliver 0.0178, an inflation of roughly 2.8×. Because most ideas do nothing, some of those winners are not merely inflated; their true effect is null, and noise alone carried them past the bar. Nothing here requires fraud or bad faith; each experiment can be analyzed by its stated rule. The bias is a property of the portfolio and the decision rule, not of any one test.

Two familiar habits make the selection rule worse. The first is peeking at results and stopping the moment significance appears, which inflates the false-positive rate unless a sequential procedure is used.¹ The second is running many metrics or segments and celebrating whichever crossed the line, the multiple-comparisons problem. Both amplify the same underlying selection bias.

Twyman’s Law

Before subtle business explanations matter, the readout itself has to be real. Twyman’s law, a maxim from media audience research, captures the idea: any figure that looks interesting or different is usually wrong. In practice, extraordinary experimental results are more often instrumentation artifacts (a logging change, a redirect that drops a cohort, a broken bot filter) than genuine product wins. A common trust check is the Sample Ratio Mismatch (SRM) test: if assignment was 50/50 but the observed split is 50.3/49.7 on millions of users, the $p$ -value on that ratio will be vanishingly small, and downstream metrics are suspect because the randomization itself is compromised.² A win that fails the SRM check should be treated first as a bug report, not as a launch candidate.

Interference: The Broken Counterfactual

An A/B test relies on the control group being a valid counterfactual, what would have happened to the treated users had they not been treated. Formally this is the Stable Unit Treatment Value Assumption (SUTVA) of the causal-inference literature: a unit’s outcome depends only on its own assignment, not on anyone else’s. It breaks down in several important settings.

Two-sided marketplaces. When a treatment makes buyers more effective at booking scarce listings, the treated buyers book more, but the listings they take are no longer available to the control buyers, who now book less. Treatment did not create supply; it reallocated a fixed pool from control to treatment. The measured lift is partly a transfer, not a gain, and can shrink sharply or disappear at 100% rollout when there is no untreated group left to borrow from. Controlled experiments on a large online marketplace have demonstrated this kind of contamination and shown how it inflates marketplace results.³
Social networks. If a feature is more engaging, treated users may message, share with, or otherwise activate their control-group friends, lifting the control and shrinking the measured difference (an underestimate), or the reverse. Either way the arms are no longer independent. Large social platforms have documented this extensively.⁴
Shared finite resources. Treatment and control draw on the same ad budget, inventory, caching tier, or rate limiter. Heavy use by one starves the other, coupling the arms through the backend.

Under interference there is no single “the effect”; what gets measured depends on the split ratio, exposure pattern, and rollout regime. Without specifying those, the usual individual-level treatment effect is not the quantity the business needs, and it does not extrapolate cleanly to full deployment.

The Wrong Quantity

An experiment can be internally valid and still estimate the wrong quantity for launch. The mismatch usually comes from scope: the metric is too narrow, the population is too narrow, the time window is too short, or the context is too special.

Metric Scope: Cannibalization and Hidden Costs

A feature can lift a local metric without improving the global one by diverting activity that would have happened anyway from another surface the experiment does not count. This cannibalization is among the easiest versions of the gap to build in, because nothing flags a metric boundary drawn too tightly.

A new home-screen carousel increases clicks on the carousel, the success metric, but the clicks come from the search box and the navigation below it. Sessions, orders, and revenue are flat: engagement has merely moved from one box to another while the scorecard records a win. Push notifications lift app opens while suppressing organic visits that would have happened anyway. A prominent promotion lifts promoted-item sales while shrinking full-margin sales of substitutes. In each case the local estimate is honest and the global effect is a wash or a loss. Cannibalization is a scope error: the metric’s boundary is drawn too tightly around the feature, so internal transfers look like external gains.

Hidden costs are the same mistake in the opposite direction: the benefit is inside the metric, while the cost is outside it. Many features have a tax: latency, payload weight, memory, a background job, an extra network round trip. Individually each tax is usually too small to register against an engagement win, so it ships. Collectively the taxes compound into a product that is measurably slower. Large-scale experiments at major search and commerce sites have found that even small latency increases, often on the order of 100 ms, can hurt revenue or engagement.⁵ The defense is an Overall Evaluation Criterion (OEC) broad enough to net out internal transfers, plus guardrails on the adjacent surfaces and system costs a feature may affect.

Population Scope: Triggered Users Are Not Everyone

A triggered analysis reports the effect for users who actually reached the change, while the business outcome is spread over everyone. This is dilution: when only a fraction of users trigger a feature, the company-wide lift is, as a first approximation, the triggered effect scaled by that fraction. A +5% conversion lift among the 30% who reach a new checkout step is closer to +1.5% over all traffic.

The distinction matters because different analyses answer different questions. An intent-to-treat readout estimates the effect of assigning eligible users to treatment, including those who never encounter the changed surface. A triggered readout is useful for diagnosing the feature mechanism, but it is a reporting lens rather than the whole launch forecast. The trigger must also be defined carefully: if treatment itself changes who becomes “triggered,” conditioning on triggered users can introduce selection bias. A good readout reports both the triggered effect and the diluted overall effect, and states which one is being used for the launch forecast.

Time Scope: Novelty, Primacy, and Long-Term Reversal

The time qualifier separates a short measurement window from the long run. A two-week experiment estimates a two-week effect, and user behavior in the first two weeks is often unrepresentative of the steady state.

Novelty effect. A change draws interaction simply because it is new. Users click the unfamiliar button to see what it does; the curiosity fades. The short-term measurement is inflated accordingly.
Primacy (learning) effect. The mirror image: users are accustomed to the old design and need time to adapt, so a genuinely better treatment underperforms at first and improves as people learn it. Here the short-term measurement is deflated, and a good change may be wrongly killed.

A redesigned navigation tab might get a burst of exploratory clicks in week one even if it does not make the product better. The opposite can happen when a cleaner checkout flow initially slows returning users who have memorized the old sequence, then outperforms once the new path is learned. Both show up as a trend in the effect over time rather than a flat line, and can be probed by segmenting new versus tenured users or by running longer and inspecting the daily effect series.

The divergence is sharpest when the short-term effect and the long-term effect have opposite signs. The textbook case is ad load: showing more ads, or more aggressive ones, can raise near-term revenue. But long-horizon studies show that heavier ad load can induce “ad blindness”: users learn to ignore ads and click less over time, so a revenue gain this week can decay into a durable loss.⁶ Features that trade future engagement for immediate conversion often fall here. A short experiment usually cannot reveal a sign flip that only appears after weeks of user learning; catching it takes a tool built for the long horizon.

Context Scope: External Validity and Drift

Even a clean estimate holds for the conditions under which it was taken. A test run during a holiday peak, a marketing push, or a competitor’s outage measures an effect entangled with that moment, and the number need not survive into ordinary weeks. This is a failure of external validity: seasonality, promotional calendars, and a shifting traffic mix make any fixed window an unrepresentative sample of the deployment it is meant to predict. The defense is measurement across several representative periods, or a long-term holdout that spans them.

Effects That Do Not Compose

The Non-Additivity Problem

Even with every effect estimated without bias, the individual results cannot simply be summed. Three checkout improvements that each measured +1% in isolation should not be expected to combine to +3%; shipped together they might be worth +1.4%.

Effects are non-additive for structural reasons:

Overlapping mechanisms. If two features both nudge the same hesitant users over the same purchase threshold, each one, measured in isolation against a control that lacks both, claims the full credit for those conversions. Run together, they cannot each convert the same user twice. Their incremental contributions overlap, so the naive sum double-counts.
Ceilings and saturation. Many metrics are bounded (a user converts at most once per session; attention is finite). As improvements stack up, each subsequent one operates on a smaller pool of remaining headroom, and marginal returns diminish.
The winner’s curse, again. Every addend in the sum is itself an inflated winner, so the sum compounds the same selection bias.

“Annualized impact” decks built by summing experiment readouts are therefore systematically optimistic, often by a large factor. A direct way to know the combined effect of many changes is to measure them combined, which is what a holdout is meant to do.

Interaction Effects

Non-additivity is about what happens when shipped changes are combined. Interaction effects are the measurement-time version: one running experiment can change the context in which another experiment is estimated.

Large experimentation platforms run hundreds of experiments concurrently by overlapping them on the same traffic. The implicit assumption is that effects are roughly orthogonal, that experiment A’s effect is the same whether or not a unit is also in experiment B. Usually this holds, and large-scale practice suggests strong interactions are rarer than feared.⁷ But they exist: two experiments that both restyle the same button, or two recommendation models that compete for the same slot, can conflict so that the combined effect is nothing like the sum of the two measured in isolation. The danger is subtle: an undetected antagonistic interaction means each per-experiment readout measures a context (the presence of the other treatment) that may not exist at launch. This is why mature platforms use isolation groups for risky changes and run automated interaction detection across the concurrent slate.

The Measurement System

Analyzing each test correctly is necessary but not sufficient. The measurement system has to answer a sequence of launch decisions: what counts as success, what invalidates a readout, what blocks launch, what the roadmap delivered in aggregate, and whether a selected win is stable enough to trust.

The remedies mirror the failures:

Failure mode	Why wins underdeliver	Main defense
Winner’s curse	Selected winners overstate their true effects	Shrinkage, splitting, replication
Instrumentation bugs	The experiment readout is not trustworthy	SRM and other validity checks
Interference	Treatment and control affect each other	Cluster, switchback, or resource-split designs
Cannibalization	Local metrics count internal transfers as gains	A broad OEC and guardrails on adjacent surfaces
Dilution	A triggered-only effect shrinks once spread over all users	Reporting triggered and overall effects
Hidden costs (performance drag)	Small per-feature costs compound into a slower product	Latency and performance guardrails
Novelty and primacy	First-weeks behavior misrepresents the steady state	Longer runs, segmenting new versus tenured users
Short- versus long-term divergence	The durable effect can differ in sign from the short-term one	Long-term holdouts, reverse experiments
External validity	A single window is an unrepresentative sample of deployment conditions	Runs across representative periods, long-term holdouts
Non-additivity	Individually measured effects do not sum	Global holdouts
Interactions	Concurrent experiments change each other’s measured context	Isolation groups, interaction detection

The Right Yardstick: The OEC

The OEC answers the first decision: what does “win” mean? It is the metric, or principled combination of metrics, used as the experiment’s decision rule because moving it is expected to improve long-term business value.⁸ It is broader than whichever local metric a feature team happens to optimize. Drawn at the right scope, it nets out internal transfers: sessions or successful bookings are harder to game than carousel clicks. Because retention or lifetime value is too slow to measure per experiment, teams rely on short-term proxies; the hard part is validating that the proxy still predicts the objective.

Guardrails and Validity Checks

Guardrails answer the launch-blocking question: what must not get worse? They protect metrics such as page-load latency, crash and error rates, unsubscribe and opt-out rates, and top-line revenue. A business guardrail can block launch regardless of the primary metric, turning diffuse costs into explicit trip-wires.

A separate class of validity checks answers whether the experiment can be trusted at all: SRM, cache-hit ratios, logging completeness, assignment balance, and bot-filter stability. These do not protect the business outcome directly; they police the measurement system. Both classes matter, but they answer different questions: “Did this hurt something important?” and “Is this readout valid?”⁹

Global and Long-Term Holdouts

Global holdouts answer the portfolio question: what did the shipped roadmap deliver in aggregate? The direct cure for sum-of-the-parts and long-term decay is aggregate measurement: everything shipped, over a long horizon, against a clean baseline. A global holdout reserves a small, randomly chosen slice of users, such as 1%, who are kept out of launches in a domain for an extended period. Everyone else gets the full stream of shipped features.

The difference between the all-features population and the holdout is the net cumulative impact of the roadmap, with overlaps, cannibalization, interactions, performance drag, and novelty decay observed together. A global holdout estimates the roadmap’s aggregate effect, not the value of each feature inside it. This is often smaller than the sum of the individual experiment wins, and that discrepancy is the launch gap, measured directly.

The cost is real. A slice of users forgoes new features, traffic is tied up for months, and the holdout experience can become stale. Holdouts also need explicit exceptions for security fixes, compliance changes, abuse prevention, serious bug fixes, and changes where withholding launch would create clear user harm. Those costs are why holdouts are reserved for whole programs, not individual changes.

Reverse Experiments (Holdbacks)

Reverse experiments answer the settled-value question: after novelty and learning have mostly played out, does the feature still matter? A holdout is set up before launches; a reverse experiment (also holdback or give-back) is set up after. A feature is shipped to 100%, allowed to settle, and then a fraction of users is moved back to the old experience. By then, novelty has decayed and learning has largely completed, so the measured difference is closer to steady-state impact. The one new transient is the holdback group’s adjustment to losing a familiar feature, so the comparison is read after that reaction settles. A reverse experiment also re-measures a selected launch on fresh data, less tied to the lucky draw that may have helped the original win.

Designing Around Interference

When SUTVA fails, the design question is not “which regression fixes this?” but “what unit can be randomized without contaminating the counterfactual?” The unit is chosen to contain the spillover, so that interference happens within arms rather than across them:

Cluster randomization assigns whole network communities, ego-clusters, or geographic regions to the same arm, so social and resource spillovers stay inside a cluster. Ego-cluster designs, which group each user with their immediate neighbors, are a representative approach.¹⁰
Switchback experiments randomize time intervals for an entire marketplace (the whole city is treated from 1 to 2pm, control from 2 to 3pm), which is a common tool for two-sided platforms where user-level splits contaminate shared supply.
Budget- or inventory-split designs physically partition the contended resource between arms so neither can starve the other.

These designs trade statistical power (there are far fewer independent clusters than users) for validity. In settings where user-level randomization would produce a precise but wrong number, that loss of power is the price of estimating the right quantity.

Debiasing the Winner’s Curse

Debiasing answers the forecast question: after selecting on a noisy win, what impact should the launch plan expect? Empirical-Bayes shrinkage pulls extreme estimates toward the prior mean by an amount that grows with their noise, directly counteracting regression to the mean.¹¹ Experiment splitting offers a simple alternative: one half of the data selects the winner, and the held-out half estimates its effect, so the estimate is no longer conditioned on the same data that selected it.¹²

Process choices also help. Adequate statistical power up front leaves less room for noise to decide which experiments win, and pre-registered metrics keep selection from ranging across a dozen outcomes until one crosses the line. Each shrinks the curse before any correction is applied.

Longer Runs and Replication

Longer runs and replication answer the stability question: is this effect durable enough to launch? Running longer lets novelty and primacy effects play out and turns a single point estimate into a trend. Replication is the empirical antidote to both the winner’s curse and Twyman’s law: a true effect tends to reproduce, while a lucky draw or logging artifact usually does not. In a culture that ships on $p < 0.05$ , a second measurement is often the simplest way to find out whether the first one was real.

What Survives Launch

Winning experiments are still evidence, and usually the best evidence a product team has. The mistake is treating a local, short-term causal estimate as if it were automatically the durable value of a launch. Real impact depends on the portfolio around the change, the metric’s scope, the time horizon, the users or resources who interfere with one another, and the costs the experiment did not score. A trustworthy experimentation system does more than find winners: it measures what survives contact with launch.

References

Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking at A/B tests: Why it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘17), 2017. 1517–1525. ↩︎
Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, and Pavel A. Dmitriev. 2019. Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘19), 2019. 2156–2164. ↩︎
Thomas Blake and Dominic Coey. 2014. Why marketplace experimentation is harder than it seems: The role of test-control interference. In Proceedings of the Fifteenth ACM Conference on Economics and Computation (EC ‘14), 2014. 567–582. ↩︎
Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘15), 2015. 2227–2236. ↩︎
Ron Kohavi, Alex Deng, Roger Longbotham, and Ya Xu. 2014. Seven rules of thumb for web site experimenters. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘14), 2014. 1857–1866. ↩︎
Henning Hohnhold, Deirdre O’Brien, and Diane Tang. 2015. Focusing on the long-term: It’s good for users and business. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘15), 2015. 1849–1858. ↩︎
Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘13), 2013. 1168–1176. ↩︎
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009. Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery 18, 1 (2009), 140–181. ↩︎
Pavel A. Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Jason Vaz. 2017. A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘17), 2017. 1427–1436. ↩︎
Guillaume Saint-Jacques, Maneesh Varshney, Jeremy Simpson, and Ya Xu. 2019. Using ego-clusters to measure network effects at LinkedIn. arXiv preprint arXiv:1903.08755 (2019). ↩︎
Minyong R. Lee and Milan Shen. 2018. Winner’s curse: Bias estimation for total effects of features in online controlled experiments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘18), 2018. 491–499. ↩︎
Dominic Coey and Tom Cunningham. 2019. Improving treatment effect estimators through experiment splitting. In The World Wide Web Conference (WWW ‘19), 2019. 285–295. ↩︎

Boyang Yue