← return to docs

Mathematical Methods for Classifying Replication Outcomes

The Replications Database provides three statistical methods for classifying whether a replication attempt was successful. All methods operate on effect sizes that have been converted to Pearson's rr (see Effect Size Normalization).

MethodQuestion Asked
Statistically Significant Effect in the Same Direction?Is the replication effect statistically significant in the same direction as the original?
Original Effect Size in Replication 95% CIDoes the original effect size fall within the replication's 95% confidence interval?
Replication Effect Size in Original 95% CIDoes the replication effect size fall within the original's 95% confidence interval?

Statistically Significant Effect in the Same Direction?

This method evaluates whether the replication study achieves a statistically significant result in the same direction as the original study.

Rationale

The simplest criterion for replication success: if the original study found a significant effect in one direction, a successful replication should also find a significant effect in that same direction. When the original study was not significant, the method checks whether the replication agrees (also non-significant) or disagrees (significant).

Algorithm

This method is inspired by the FReD R package (criterion = "significance_r"), with modifications to handle non-significant originals and to prefer reported pp-values over computed ones.

Step 1: Determine pp-Values

For both the original and replication studies, the pp-value is determined using this priority:

  1. Use the reported pp-value from the database (original_p_value or replication_p_value) if available
  2. Otherwise, compute the pp-value from the normalized Pearson rr and sample size nn:

t=rn21r2t = r \cdot \sqrt{\frac{n - 2}{1 - r^2}}

Compute the two-tailed pp-value with df=n2df = n - 2 degrees of freedom.

Reported pp-values are preferred because the conversion from other effect size types (Cohen's dd, eta-squared, etc.) to Pearson rr introduces error, especially with small samples. This can cause computed pp-values to disagree with reported ones on significance in approximately 10% of cases.

Step 2: Check if the Original Study Was Significant

If pO0.05p_O \geq 0.05, the original was not significant. In this case, we check whether the replication agrees:

  • If the replication is also not significant (pR0.05p_R \geq 0.05): both studies agree there is no effect → Success
  • If the replication is significant (pR<0.05p_R < 0.05): the studies disagree → Failure

Step 3: If the Original Was Significant, Test the Replication

If the original was significant (pO<0.05p_O < 0.05), check the replication's significance and direction consistency:

  • Same direction: sign(rO)=sign(rR)\text{sign}(r_O) = \text{sign}(r_R)
  • Opposite direction: sign(rO)sign(rR)\text{sign}(r_O) \neq \text{sign}(r_R)

Classification

ConditionOutcome
Original not significant (pO0.05p_O \geq 0.05), replication also not significant (pR0.05p_R \geq 0.05)Success
Original not significant (pO0.05p_O \geq 0.05), replication significant (pR<0.05p_R < 0.05)Failure
Original significant, replication significant (pR<0.05p_R < 0.05) with same directionSuccess
Original significant, replication significant (pR<0.05p_R < 0.05) with opposite directionReversal
Original significant, replication not significant (pR0.05p_R \geq 0.05)Failure

Original Effect Size in Replication 95% Confidence Interval

This method checks whether the original effect size is a plausible value given the replication results, by testing if it falls within the replication's confidence interval.

Rationale

If the original finding is "true," we would expect the original effect size to be consistent with the replication's estimate. This is operationalized by checking whether the original effect falls within the 95% confidence interval of the replication effect.

This method is implemented consistently with the FReD R package (criterion = "consistency_ci").

Confidence Interval Source

The method uses a two-strategy approach to maximize compatibility with original papers:

Strategy 1 (Primary): Pre-computed CI with Raw Effect Sizes

If the database contains a pre-computed 95% CI for the replication effect size (in the replication_es_95_CI column), this CI is compared against the raw original effect size (original_es). This matches the methodology used in original replication studies, where effect sizes and CIs are in their native units (Cohen's d, Hazard Ratio, etc.).

Strategy 2 (Fallback): Computed CI with Normalized Effect Sizes

If no pre-computed CI is available, the CI is computed using the Fisher zz-transformation method from the normalized Pearson's rr values and sample sizes (see Computing Confidence Intervals).

Classification

ConditionOutcome
Original ES within replication 95% CISuccess
Original ES outside replication 95% CIFailure
Cannot obtain CI (missing data)Inconclusive

Advantages

  • Accounts for uncertainty in the replication estimate
  • Does not require significance in either study
  • Provides a more nuanced assessment than simple significance testing
  • Effect size magnitude matters, not just statistical significance
  • When pre-computed CIs are available, results match original paper methodology

Replication Effect Size in Original 95% Confidence Interval

This method checks whether the replication effect size is a plausible value given the original results, by testing if it falls within the original's confidence interval.

Rationale

This is the "mirror" of the previous method. If the replication is measuring the same underlying effect, we would expect the replication effect size to be consistent with the original's estimate. This is operationalized by checking whether the replication effect falls within the 95% confidence interval of the original effect.

This method is particularly useful when the original study had a larger sample size than the replication, giving it a narrower confidence interval.

Confidence Interval Source

The method uses a two-strategy approach to maximize compatibility with original papers:

Strategy 1 (Primary): Pre-computed CI with Raw Effect Sizes

If the database contains a pre-computed 95% CI for the original effect size (in the original_es_95_CI column), this CI is compared against the raw replication effect size (replication_es). This matches the methodology used in original replication studies, where effect sizes and CIs are in their native units (Cohen's d, Hazard Ratio, etc.).

Strategy 2 (Fallback): Computed CI with Normalized Effect Sizes

If no pre-computed CI is available, the CI is computed using the Fisher zz-transformation method from the normalized Pearson's rr values and sample sizes (see Computing Confidence Intervals).

Classification

ConditionOutcome
Replication ES within original 95% CISuccess
Replication ES outside original 95% CIFailure
Cannot obtain CI (missing data)Inconclusive

Comparison with "Original in Replication CI"

These two methods can give different results:

  • Original in Replication CI asks: "Is the original effect plausible given the replication data?"
  • Replication in Original CI asks: "Is the replication effect plausible given the original data?"

The difference matters when sample sizes differ substantially. A small replication study will have a wide CI, making it easy for the original effect to fall within it (high "success" rate). Conversely, if the original study was large with a narrow CI, the replication effect must be very close to the original to fall within it.


Computing Confidence Intervals (Fisher zz-Transformation)

When pre-computed confidence intervals are not available in the database, they are computed using the Fisher zz-transformation method.

Algorithm

Step 1: Fisher rr-to-zz Transformation

The sampling distribution of rr is not normal, especially for values far from zero. The Fisher transformation converts rr to a normally distributed variable zz:

z=12ln(1+r1r)=arctanh(r)z = \frac{1}{2} \ln\left(\frac{1 + r}{1 - r}\right) = \text{arctanh}(r)

Step 2: Compute Standard Error in zz-space

The standard error of zz depends only on sample size:

SEz=1n3SE_z = \frac{1}{\sqrt{n - 3}}

where nn is the sample size. This requires n>3n > 3.

Step 3: Compute 95% Confidence Interval in zz-space

zlower=z1.96SEzz_{lower} = z - 1.96 \cdot SE_z zupper=z+1.96SEzz_{upper} = z + 1.96 \cdot SE_z

Step 4: Inverse Fisher zz-to-rr Transformation

Transform the confidence bounds back to the rr scale:

r=e2z1e2z+1=tanh(z)r = \frac{e^{2z} - 1}{e^{2z} + 1} = \tanh(z)

This yields asymmetric confidence intervals in rr-space, which is statistically appropriate since rr is bounded by [1,1][-1, 1].

Example

Given:

  • Original effect: rO=0.35r_O = 0.35
  • Replication effect: rR=0.28r_R = 0.28
  • Replication sample size: n=100n = 100

Computing the replication CI:

  1. Fisher transform: zR=arctanh(0.28)=0.288z_R = \text{arctanh}(0.28) = 0.288
  2. Standard error: SEz=1/97=0.102SE_z = 1/\sqrt{97} = 0.102
  3. CI in zz-space: [0.2881.96×0.102,0.288+1.96×0.102]=[0.089,0.487][0.288 - 1.96 \times 0.102, 0.288 + 1.96 \times 0.102] = [0.089, 0.487]
  4. CI in rr-space: [tanh(0.089),tanh(0.487)]=[0.089,0.452][\tanh(0.089), \tanh(0.487)] = [0.089, 0.452]
  5. Is 0.350.35 in [0.089,0.452][0.089, 0.452]? YesSuccess

Computing pp-Values from Correlation Coefficients

The significance-based outcome method requires pp-values for both original and replication studies. When the database contains a reported pp-value (original_p_value or replication_p_value), that value is used directly. Otherwise, pp-values are computed from the normalized Pearson rr correlation coefficients as described below.

From Correlation to tt-Statistic

For a Pearson correlation coefficient rr computed from nn observations, the test statistic follows a tt-distribution under the null hypothesis (H0:ρ=0H_0: \rho = 0):

t=rn21r2t = r \cdot \sqrt{\frac{n - 2}{1 - r^2}}

with df=n2df = n - 2 degrees of freedom.

Computing Two-Tailed pp-Values

The two-tailed pp-value is computed from the tt-distribution cumulative distribution function (CDF). For a tt-statistic with ν\nu degrees of freedom:

p=2P(T>t)=Ix(ν2,12)p = 2 \cdot P(T > |t|) = I_x\left(\frac{\nu}{2}, \frac{1}{2}\right)

where x=νν+t2x = \frac{\nu}{\nu + t^2} and Ix(a,b)I_x(a, b) is the regularized incomplete beta function.

Regularized Incomplete Beta Function

The regularized incomplete beta function is defined as:

Ix(a,b)=B(x;a,b)B(a,b)=1B(a,b)0xta1(1t)b1dtI_x(a, b) = \frac{B(x; a, b)}{B(a, b)} = \frac{1}{B(a, b)} \int_0^x t^{a-1}(1-t)^{b-1} \, dt

where B(a,b)=Γ(a)Γ(b)Γ(a+b)B(a, b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} is the complete beta function.

Implementation

The two-tailed pp-value is computed using the jStat JavaScript statistical library, which provides a well-tested implementation of the Student's tt-distribution CDF. Specifically:

p=2P(T<t)=2Ft(t;ν)p = 2 \cdot P(T < -|t|) = 2 \cdot F_t(-|t|;\, \nu)

where FtF_t is the tt-distribution CDF with ν=n2\nu = n - 2 degrees of freedom.

Example

Given:

  • Correlation: r=0.35r = 0.35
  • Sample size: n=25n = 25

Computing the pp-value:

  1. Compute tt-statistic: t=0.352310.1225=0.3526.21=1.792t = 0.35 \cdot \sqrt{\frac{23}{1 - 0.1225}} = 0.35 \cdot \sqrt{26.21} = 1.792
  2. Degrees of freedom: df=23df = 23
  3. Compute x=2323+3.21=0.878x = \frac{23}{23 + 3.21} = 0.878
  4. Compute Ix(11.5,0.5)I_x(11.5, 0.5) using continued fraction
  5. Two-tailed pp-value: p0.086p \approx 0.086

Since p>0.05p > 0.05, this correlation is not statistically significant at the conventional threshold.


References

LeBel, E. P., Vanpaemel, W., Cheung, I., & Campbell, L. (2019). A brief guide to evaluate replications. Meta-Psychology, 3.

Röseler, L., & Kühberger, A. (2025). FReD: The Framework for Replication Databases. MetaArXiv Preprints.

Röseler, L., Weber, L., Helber, J., et al. (2024). The Replication Database: Documenting the Replicability of Psychological Science. Journal of Open Psychology Data.