Assessing heterogeneity and power in replications of psychological experiments

Abstract

In this study, we reanalyze recent empirical research on replication from a meta-analytic perspective. We argue that there are different ways to define ‘replication failure,’ and that analyses can focus on exploring variation among replication studies or assess whether their results contradict the findings of the original study. We apply this framework to a set of psychological findings that have been replicated and assess the sensitivity of these analyses. We find that tests for replication that involve only a single replication study are almost always severely underpowered. Among the 40 findings for which ensembles of multisite direct replications were conducted, we find that between 11 and 17 (28% to 43%) ensembles produced heterogeneous effects, depending on how replication is defined. This heterogeneity could not be completely explained by moderators documented by replication research programs. We also find that these ensembles were not always well-powered to detect potentially meaningful values of heterogeneity. Finally, we identify several discrepancies between the results of original studies and the distribution of effects found by multisite replications but note that these analyses also have low power. We conclude by arguing that efforts to assess replication would benefit from further methodological work on designing replication studies to ensure analyses are sufficiently sensitive.

Publication
Psychological Bulletin