Skip to content

Validation Against GCTA-COJO

cojopy is a Python re-implementation of the COJO (Conditional and Joint) analysis from GCTA (Yang et al., 2012). Both tools take GWAS summary statistics and an LD reference as input, then perform conditional/joint association analysis. The key difference is that GCTA-COJO computes LD internally from individual-level genotype data (.bed/.bim/.fam), while cojopy takes a pre-computed LD correlation matrix directly.

To verify that cojopy produces identical results to GCTA-COJO when given the same LD information, we built an end-to-end simulation that generates data from scratch and compares results across all three analysis modes.

Simulation Design

Parameter Value
Samples 2,000
SNPs 1,000
LD blocks 5 (200 SNPs each, exponential-decay correlation)
Causal SNPs 5 (one per block)
Effect sizes 0.25 -- 0.50
Heritability (h²) 0.30
P-value threshold 5 x 10⁻⁸
Collinearity cutoff (r²) 0.90

Pipeline

  1. Simulate genotypes -- Multivariate-normal draws discretized to 0/1/2 dosages, arranged in 5 independent LD blocks.
  2. Simulate phenotypes -- Linear model with 5 known causal SNPs plus Gaussian noise scaled to target h².
  3. Run GWAS -- PLINK 1.9 --linear to get marginal effect estimates; --freq for allele frequencies; --r square for the LD matrix.
  4. Format .ma file -- Allele-aligned summary statistics (SNP, A1, A2, freq, b, se, p, N). A1 is matched to the .bim file to prevent allele flips.
  5. Run GCTA-COJO -- --cojo-slct, --cojo-joint, --cojo-cond using genotype data (.bed) for LD.
  6. Run cojopy -- slct, joint, cond using the same .ma file and the PLINK-derived LD matrix.
  7. Compare -- Pearson correlation and maximum absolute difference on beta, SE, and p-values.

Because both tools derive LD from the same genotype data (GCTA internally, cojopy via PLINK --r square), any differences in output reflect implementation differences rather than input differences.

Results

All three analysis modes passed with Pearson r > 0.999:

Analysis SNPs Beta r SE r Max |diff beta| Max |diff SE|
slct 5 (identical set) 1.000000 1.000000 4.25 x 10⁻⁷ 2.01 x 10⁻⁸
joint 5 1.000000 1.000000 4.25 x 10⁻⁷ 2.01 x 10⁻⁸
cond 999 1.000000 0.999995 5.61 x 10⁻⁷ 5.00 x 10⁻⁸

All discrepancies are at the ~10⁻⁷ level, well within floating-point arithmetic tolerance. Raw output tables are shown below.

Stepwise Selection (slct)

Both tools selected exactly the same 5 causal SNPs in the same order. Column mapping: GCTA bJ/bJ_se/pJ = cojopy joint_beta/joint_se/joint_p.

GCTA-COJO (--cojo-slct):

SNP freq b se p bJ bJ_se pJ
rs51 0.30 0.7495 0.04705 3.924e-57 0.779143 0.04999 9.036e-55
rs261 0.30 0.6191 0.04799 4.501e-38 0.639229 0.05003 2.195e-37
rs471 0.30 0.5056 0.04866 2.754e-25 0.502901 0.04995 7.721e-24
rs681 0.30 0.4850 0.04874 2.522e-23 0.474079 0.05003 2.644e-21
rs891 0.30 0.2630 0.04959 1.139e-07 0.325802 0.05005 7.568e-11

cojopy (slct):

SNP original_beta original_se original_p joint_beta joint_se joint_p
rs51 0.7495 0.04705 6.790e-54 0.779143 0.04999 9.036e-55
rs261 0.6191 0.04799 1.185e-36 0.639229 0.05003 2.195e-37
rs471 0.5056 0.04866 1.100e-24 0.502901 0.04995 7.721e-24
rs681 0.4850 0.04874 8.487e-23 0.474079 0.05003 2.644e-21
rs891 0.2630 0.04959 1.265e-07 0.325802 0.05005 7.568e-11

Joint Analysis (joint)

Given the same 5 SNPs, the joint results are identical to slct (deterministic computation, same SNP set).

Conditional Analysis (cond)

Conditional on rs51, first 10 SNPs shown. Column mapping: GCTA bC/bC_se/pC = cojopy cond_beta/cond_se/cond_p.

GCTA-COJO (--cojo-cond):

SNP b se p bC bC_se pC
rs1 0.0768 0.04990 0.1238 0.03221 0.04992 0.5187
rs2 0.04534 0.04993 0.3639 0.006084 0.04993 0.9030
rs3 0.03242 0.04994 0.5162 -0.007728 0.04993 0.8770
rs4 0.04687 0.04993 0.3479 -0.009332 0.04993 0.8517
rs5 0.01306 0.04994 0.7937 -0.02263 0.04993 0.6504
rs6 0.02343 0.04994 0.6389 -0.02385 0.04993 0.6329
rs7 0.0008159 0.04993 0.9870 -0.03931 0.04992 0.4310
rs8 0.001178 0.04994 0.9812 -0.04520 0.04992 0.3653
rs9 0.006928 0.04995 0.8897 -0.04036 0.04994 0.4189
rs10 0.03566 0.04993 0.4751 -0.03034 0.04992 0.5434

cojopy (cond):

SNP original_beta original_se original_p cond_beta cond_se cond_p
rs1 0.0768 0.04990 0.1240 0.03221 0.04992 0.5187
rs2 0.04534 0.04993 0.3640 0.006084 0.04993 0.9030
rs3 0.03242 0.04994 0.5163 -0.007728 0.04993 0.8770
rs4 0.04687 0.04993 0.3480 -0.009332 0.04993 0.8517
rs5 0.01306 0.04994 0.7937 -0.02263 0.04993 0.6504
rs6 0.02343 0.04994 0.6390 -0.02385 0.04993 0.6329
rs7 0.0008159 0.04993 0.9870 -0.03931 0.04992 0.4310
rs8 0.001178 0.04994 0.9812 -0.04520 0.04992 0.3653
rs9 0.006928 0.04995 0.8897 -0.04036 0.04994 0.4189
rs10 0.03566 0.04993 0.4752 -0.03034 0.04992 0.5434

Why the Results Match

Both GCTA-COJO and cojopy implement the same formulas from Yang et al. (2012):

  • Phenotype variance (Eq. 8): median of per-SNP variance estimates.
  • Effective sample size (Eq. 13): accounts for the effect each SNP explains.
  • Joint effects (Eq. 12): beta_joint = (X'X)^{-1} D beta_marginal, where X'X is reconstructed from LD correlations, allele frequencies, and effective sample sizes.
  • Conditional effects: same matrix algebra, partitioning selected vs. unselected SNPs.

The only source of numerical difference is floating-point arithmetic order, which explains the ~10⁻⁷ magnitude of discrepancies.

Reproducing the Validation

cd simulation
python run_simulation.py

The script automatically downloads GCTA and PLINK (first run only), generates all data, runs both tools, and reports pass/fail with scatter plots in simulation/output/comparison/.

Reference

Yang J, Ferreira T, Morris AP, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nature Genetics. 2012;44(4):369-375. doi:10.1038/ng.2213