Empirical evaluations of replication have become increasingly common, ranging from systematic attempts of one-off replications (e.g., Open Science Collaboration, 2015) to the Many Labs approach, where multiple labs independently run the same experiment (Klein et al., 2014). Designing such programs has largely contended with difficult issues about which experimental components are necessary for a set of studies to be considered replicates. However, another important consideration is that replicate studies be designed to support sufficiently sensitive analyses. For instance, if hypothesis tests are to be conducted about replication, studies should be designed to ensure these tests are well-powered; if not, it can be difficult to determine conclusively if replication attempts succeeded or failed. This paper describes methods for designing ensembles of replication studies to ensure that they are both adequately sensitive and cost-efficient. It describes two potential analyses of replication studies–hypothesis tests and variance component estimation–and approaches to obtaining optimal designs for them. Using these results, it assesses the sensitivity and optimality of the Many Labs design and finds that while it may have been sufficiently powered to detect some larger differences between studies, other designs would have been less costly or more sensitive (or in some cases, both).
Abstract: Empirical evaluations of replication have become increasingly common, ranging from systematic attempts of one-off replications (e.g., Open Science Collaboration, 2015) to the Many Labs approach, where multiple labs independently run the same experiment (Klein et al., 2014). Designing such programs has largely contended with difficult issues about which experimental components are necessary for a set of studies to be considered replicates. However, another important consideration is that replicate studies be designed to support sufficiently sensitive analyses. For instance, if hypothesis tests are to be conducted about replication, studies should be designed to ensure these tests are well-powered; if not, it can be difficult to determine conclusively if replication attempts succeeded or failed. This paper describes methods for designing ensembles of replication studies to ensure that they are both adequately sensitive and cost-efficient. It describes two potential analyses of replication studies–hypothesis tests and variance component estimation–and approaches to obtaining optimal designs for them. Using these results, it assesses the sensitivity and optimality of the Many Labs design and finds that while it may have been sufficiently powered to detect some larger differences between studies, other designs would have been less costly or more sensitive (or in some cases, both).