Reconsidering statistical methods for assessing replication
Abstract: Recent empirical evaluations of replication in psychology have reported startlingly few successful replication attempts. At the same time, they have noted that the proper way to analyze replication studies is far from a settled matter and have thus analyzed their data in several different ways. This presents two challenges to interpreting the results of these programs. First, different analysis methods assess different operational definitions of replication. Second, the properties of these methods are not necessarily common knowledge; it is possible for a successful replication to be deemed a failure by nearly all of the metrics used, and it is not always immediately clear how likely such errors are to occur. In this article, we describe the methods commonly used in replication research and how they imply specific operational definitions of replication. We then compute the probability of false failure (i.e., a successful replication is concluded to have failed) and false success determinations. These are shown to be high (often over 50%) and in many cases uncontrolled. We then demonstrate that errors are probable in the data to which they have been applied in the literature. We show that the probability that some conclusions in the literature about replication are incorrect can be as high as 75-80%.