We have a very different approach to replications compared to previous initiatives. Here is a brief summary of how we define terms and think about things:
We are interested in effects, like "prozac helps people with depression", "LK-99 is a superconductor", or "people drink less from red-labeled cups than blue-labeled cups."
We define a replication as "an experiment which is done to test an effect claim made in prior research." (closely following Nosek & Errington, 2020)
The replication experiment may use methods very similar to the prior experiment, or completely different methods. The important thing is whether the effect being tested for is the same, not the particular methods used to test for the effect.
We classify the replication experiment results into four categories:
The classification we just mentioned says nothing about the magnitude of an effect. Our first analysis is not concerned with magnitude. However, we pull down data on effect magnitudes when possible. However, comparing the effect magnitudes measured by different experiments can be tricky. For instance, one study might test whether Prozac helps with depression using a Beck Depression Inventory administered after three weeks, while another study might use a Hamilton Depression Rating Scale administered at three months. Where possible we compare experimental findings using a standard scale or effect size measurement device like Cohen's d.
As just mentioned, a replication must involve a new experiment. Therefore we do not consider re-analyses of previously published data to be replications. We note that previous authors have considered the reanalysis of data as "technical replications". Currently our database does not contain technical replications, although it may in the future.
We are interested in the discovery of mistakes, but we consider them "errata". In the future we may have a separate database for mistakes, errors, etc.
Our definition of a "replication experiment" is very broad, and some may think it too broad. Researchers discussing replication generally distinguish two or more forms of replication, such as:
"Technical" replication (also called or "robustness checking") - where a new experiment is not done, but raw data from an existing experiment is reanalyzed using the reported procedures. Or, it may involve simply running provided code on data to get results (this is called "frictionless" reproduction).
"Exact", "direct", or "narrow-sense" replication - where an experimental procedure is repeated as closely as possible, usually following the specifications for the procedure given in the original paper. This is the most common understanding of the term "replication".
"Close" or "systematic" replication - where an experimental procedure is repeated closely, but with one or more intentional changes.
"Conceptual" or "broad-sense" replication - where a finding from a previous experiment is tested in a new experiment using a different experimental procedure.
While all these terms have merit, they are hard to define precisely, with the exception of "technical replication", which we don't really view as a replication since no new experiment is done.
Clearly, replication is a spectrum, ranging from a "direct"/"exact" replication to "broad-sense" replication. Instead of trying to distinguish different degrees of replication, we consider any experiment on the spectrum to be a replication. This simplifies our work greatly, making it easier for us to achieve scale with our database.
At the end of the day, we are most interested in whether scientist's claims hold up when subjected to additional experimental tests. If science is healthy, claims should hold up. If science is unhealthy, claims will not replicate (due to methodological/logical/mathematical errors, other mistakes, improper reporting, intentional fraud, etc).
We typically define the "original experiment" as the earliest published experiment that resulted in a claim that an effect exists. We define a "replication experiment" as any experiment published after the original experiment was published which tested for the same effect.
We believe the proper level of analysis is effects, not papers. While it is often true that scientific papers have one central claim, most papers report many effects. A replication paper may replicate some of those effects and fail to replicate others.
Scientific papers often have one or more major claims, like "Prozac helps with depression". In a narrow sense all an experiment may have shown was "Prozac helps with depression in people diagnosed with depression by a clinician at a Houston-area hospital system in the years 2010-2015", but based on theoretical considerations, the authors claim their work gives strong evidence for the general effect that "Prozac helps with depression". We are most interested in the replicability of the general effects that scientists claim, not more narrow readings of an experiment. Other examples of general claims are "neutrinos can travel faster than light", "the MMR vaccine causes autism", "power posing increases success in job interviews", and "water can exist in a polymerized form" (all of those claims failed replication and are now considered false).