Mathematical Methods for Classifying Replication Outcomes
The Replications Database provides three statistical methods for classifying whether a replication attempt was successful. All methods operate on effect sizes that have been converted to Pearson's (see Effect Size Normalization).
| Method | Question Asked |
|---|---|
| Statistically Significant Effect in the Same Direction? | Is the replication effect statistically significant in the same direction as the original? |
| Original Effect Size in Replication 95% CI | Does the original effect size fall within the replication's 95% confidence interval? |
| Replication Effect Size in Original 95% CI | Does the replication effect size fall within the original's 95% confidence interval? |
Statistically Significant Effect in the Same Direction?
This method evaluates whether the replication study achieves a statistically significant result in the same direction as the original study.
Rationale
The simplest criterion for replication success: if the original study found a significant effect in one direction, a successful replication should also find a significant effect in that same direction. When the original study was not significant, the method checks whether the replication agrees (also non-significant) or disagrees (significant).
Algorithm
This method is inspired by the FReD R package (criterion = "significance_r"), with modifications to handle non-significant originals and to prefer reported -values over computed ones.
Step 1: Determine -Values
For both the original and replication studies, the -value is determined using this priority:
- Use the reported -value from the database (
original_p_valueorreplication_p_value) if available - Otherwise, compute the -value from the normalized Pearson and sample size :
Compute the two-tailed -value with degrees of freedom.
Reported -values are preferred because the conversion from other effect size types (Cohen's , eta-squared, etc.) to Pearson introduces error, especially with small samples. This can cause computed -values to disagree with reported ones on significance in approximately 10% of cases.
Step 2: Check if the Original Study Was Significant
If , the original was not significant. In this case, we check whether the replication agrees:
- If the replication is also not significant (): both studies agree there is no effect → Success
- If the replication is significant (): the studies disagree → Failure
Step 3: If the Original Was Significant, Test the Replication
If the original was significant (), check the replication's significance and direction consistency:
- Same direction:
- Opposite direction:
Classification
| Condition | Outcome |
|---|---|
| Original not significant (), replication also not significant () | Success |
| Original not significant (), replication significant () | Failure |
| Original significant, replication significant () with same direction | Success |
| Original significant, replication significant () with opposite direction | Reversal |
| Original significant, replication not significant () | Failure |
Original Effect Size in Replication 95% Confidence Interval
This method checks whether the original effect size is a plausible value given the replication results, by testing if it falls within the replication's confidence interval.
Rationale
If the original finding is "true," we would expect the original effect size to be consistent with the replication's estimate. This is operationalized by checking whether the original effect falls within the 95% confidence interval of the replication effect.
This method is implemented consistently with the FReD R package (criterion = "consistency_ci").
Confidence Interval Source
The method uses a two-strategy approach to maximize compatibility with original papers:
Strategy 1 (Primary): Pre-computed CI with Raw Effect Sizes
If the database contains a pre-computed 95% CI for the replication effect size (in the replication_es_95_CI column), this CI is compared against the raw original effect size (original_es). This matches the methodology used in original replication studies, where effect sizes and CIs are in their native units (Cohen's d, Hazard Ratio, etc.).
Strategy 2 (Fallback): Computed CI with Normalized Effect Sizes
If no pre-computed CI is available, the CI is computed using the Fisher -transformation method from the normalized Pearson's values and sample sizes (see Computing Confidence Intervals).
Classification
| Condition | Outcome |
|---|---|
| Original ES within replication 95% CI | Success |
| Original ES outside replication 95% CI | Failure |
| Cannot obtain CI (missing data) | Inconclusive |
Advantages
- Accounts for uncertainty in the replication estimate
- Does not require significance in either study
- Provides a more nuanced assessment than simple significance testing
- Effect size magnitude matters, not just statistical significance
- When pre-computed CIs are available, results match original paper methodology
Replication Effect Size in Original 95% Confidence Interval
This method checks whether the replication effect size is a plausible value given the original results, by testing if it falls within the original's confidence interval.
Rationale
This is the "mirror" of the previous method. If the replication is measuring the same underlying effect, we would expect the replication effect size to be consistent with the original's estimate. This is operationalized by checking whether the replication effect falls within the 95% confidence interval of the original effect.
This method is particularly useful when the original study had a larger sample size than the replication, giving it a narrower confidence interval.
Confidence Interval Source
The method uses a two-strategy approach to maximize compatibility with original papers:
Strategy 1 (Primary): Pre-computed CI with Raw Effect Sizes
If the database contains a pre-computed 95% CI for the original effect size (in the original_es_95_CI column), this CI is compared against the raw replication effect size (replication_es). This matches the methodology used in original replication studies, where effect sizes and CIs are in their native units (Cohen's d, Hazard Ratio, etc.).
Strategy 2 (Fallback): Computed CI with Normalized Effect Sizes
If no pre-computed CI is available, the CI is computed using the Fisher -transformation method from the normalized Pearson's values and sample sizes (see Computing Confidence Intervals).
Classification
| Condition | Outcome |
|---|---|
| Replication ES within original 95% CI | Success |
| Replication ES outside original 95% CI | Failure |
| Cannot obtain CI (missing data) | Inconclusive |
Comparison with "Original in Replication CI"
These two methods can give different results:
- Original in Replication CI asks: "Is the original effect plausible given the replication data?"
- Replication in Original CI asks: "Is the replication effect plausible given the original data?"
The difference matters when sample sizes differ substantially. A small replication study will have a wide CI, making it easy for the original effect to fall within it (high "success" rate). Conversely, if the original study was large with a narrow CI, the replication effect must be very close to the original to fall within it.
Computing Confidence Intervals (Fisher -Transformation)
When pre-computed confidence intervals are not available in the database, they are computed using the Fisher -transformation method.
Algorithm
Step 1: Fisher -to- Transformation
The sampling distribution of is not normal, especially for values far from zero. The Fisher transformation converts to a normally distributed variable :
Step 2: Compute Standard Error in -space
The standard error of depends only on sample size:
where is the sample size. This requires .
Step 3: Compute 95% Confidence Interval in -space
Step 4: Inverse Fisher -to- Transformation
Transform the confidence bounds back to the scale:
This yields asymmetric confidence intervals in -space, which is statistically appropriate since is bounded by .
Example
Given:
- Original effect:
- Replication effect:
- Replication sample size:
Computing the replication CI:
- Fisher transform:
- Standard error:
- CI in -space:
- CI in -space:
- Is in ? Yes → Success
Computing -Values from Correlation Coefficients
The significance-based outcome method requires -values for both original and replication studies. When the database contains a reported -value (original_p_value or replication_p_value), that value is used directly. Otherwise, -values are computed from the normalized Pearson correlation coefficients as described below.
From Correlation to -Statistic
For a Pearson correlation coefficient computed from observations, the test statistic follows a -distribution under the null hypothesis ():
with degrees of freedom.
Computing Two-Tailed -Values
The two-tailed -value is computed from the -distribution cumulative distribution function (CDF). For a -statistic with degrees of freedom:
where and is the regularized incomplete beta function.
Regularized Incomplete Beta Function
The regularized incomplete beta function is defined as:
where is the complete beta function.
Implementation
The two-tailed -value is computed using the jStat JavaScript statistical library, which provides a well-tested implementation of the Student's -distribution CDF. Specifically:
where is the -distribution CDF with degrees of freedom.
Example
Given:
- Correlation:
- Sample size:
Computing the -value:
- Compute -statistic:
- Degrees of freedom:
- Compute
- Compute using continued fraction
- Two-tailed -value:
Since , this correlation is not statistically significant at the conventional threshold.
References
LeBel, E. P., Vanpaemel, W., Cheung, I., & Campbell, L. (2019). A brief guide to evaluate replications. Meta-Psychology, 3.
Röseler, L., & Kühberger, A. (2025). FReD: The Framework for Replication Databases. MetaArXiv Preprints.
Röseler, L., Weber, L., Helber, J., et al. (2024). The Replication Database: Documenting the Replicability of Psychological Science. Journal of Open Psychology Data.