How robust are experimental results to changes in design? And can researchers anticipate which changes matter most? We consider a specific context, a real-effort task with multiple behavioral treatments, and examine the stability along six dimensions: (i) pure replication; (ii) demographics; (iii) geography and culture; (iv) the task; (v) the output measure; (vi) the presence of a consent form. We use rank-order correlation across the treatments as measure of stability, and compare the observed correlation to the one under a benchmark of full stability (which allows for noise), and to expert forecasts. The academic experts expect that the pure replication will be close to perfect, that the results will differ sizably across demographic groups (age/gender/education), and that changes to the task and output will make a further impact. We find near perfect replication of the experimental results, and full stability of the results across demographics, significantly higher than the experts expected. The results are quite different across task and output change, mostly because the task change adds noise to the findings. The results are also stable to the lack of consent. Overall, the full stability benchmark is an excellent predictor of the observed stability, while expert forecasts are not that informative. This suggests that researchers’ predictions about external validity may not be as informative as they expect. We discuss the implications of both the methods and the results for conceptual replication.

More on this topic

BFI Working Paper·Feb 23, 2026

Multidimensional Signaling and the Rise of Cultural Politics

Daron Acemoglu, Georgy Egorov, and Konstantin Sonin
Topics: Uncategorized
BFI Working Paper·Feb 2, 2026

Diversionary Escalation: Theory and Evidence from Eastern Ukraine

Natalie Ayers, Christopher W. Blair, Joseph J. Ruggiero, Konstantin Sonin, and Austin Wright
Topics: Uncategorized
BFI Working Paper·Jan 26, 2026

Never Enough: Dynamic Status Incentives in Organizations

Leonardo Bursztyn, Ewan Rawcliffe, and Hans-Joachim Voth
Topics: Uncategorized