Validation Against GCTA-COJO¶
cojopy is a Python re-implementation of the COJO (Conditional and Joint) analysis from GCTA (Yang et al., 2012). Both tools take GWAS summary statistics and an LD reference as input, then perform conditional/joint association analysis. The key difference is that GCTA-COJO computes LD internally from individual-level genotype data (.bed/.bim/.fam), while cojopy takes a pre-computed LD correlation matrix directly.
To verify that cojopy produces identical results to GCTA-COJO when given the same LD information, we built an end-to-end simulation that generates data from scratch and compares results across all three analysis modes.
Simulation Design¶
| Parameter | Value |
|---|---|
| Samples | 2,000 |
| SNPs | 1,000 |
| LD blocks | 5 (200 SNPs each, exponential-decay correlation) |
| Causal SNPs | 5 (one per block) |
| Effect sizes | 0.25 -- 0.50 |
| Heritability (h²) | 0.30 |
| P-value threshold | 5 x 10⁻⁸ |
| Collinearity cutoff (r²) | 0.90 |
Pipeline¶
- Simulate genotypes -- Multivariate-normal draws discretized to 0/1/2 dosages, arranged in 5 independent LD blocks.
- Simulate phenotypes -- Linear model with 5 known causal SNPs plus Gaussian noise scaled to target h².
- Run GWAS -- PLINK 1.9
--linearto get marginal effect estimates;--freqfor allele frequencies;--r squarefor the LD matrix. - Format
.mafile -- Allele-aligned summary statistics (SNP, A1, A2, freq, b, se, p, N). A1 is matched to the.bimfile to prevent allele flips. - Run GCTA-COJO --
--cojo-slct,--cojo-joint,--cojo-condusing genotype data (.bed) for LD. - Run cojopy --
slct,joint,condusing the same.mafile and the PLINK-derived LD matrix. - Compare -- Pearson correlation and maximum absolute difference on beta, SE, and p-values.
Because both tools derive LD from the same genotype data (GCTA internally, cojopy via PLINK --r square), any differences in output reflect implementation differences rather than input differences.
Results¶
All three analysis modes passed with Pearson r > 0.999:
| Analysis | SNPs | Beta r | SE r | Max |diff beta| | Max |diff SE| |
|---|---|---|---|---|---|
| slct | 5 (identical set) | 1.000000 | 1.000000 | 4.25 x 10⁻⁷ | 2.01 x 10⁻⁸ |
| joint | 5 | 1.000000 | 1.000000 | 4.25 x 10⁻⁷ | 2.01 x 10⁻⁸ |
| cond | 999 | 1.000000 | 0.999995 | 5.61 x 10⁻⁷ | 5.00 x 10⁻⁸ |
All discrepancies are at the ~10⁻⁷ level, well within floating-point arithmetic tolerance. Raw output tables are shown below.
Stepwise Selection (slct)¶
Both tools selected exactly the same 5 causal SNPs in the same order. Column mapping: GCTA bJ/bJ_se/pJ = cojopy joint_beta/joint_se/joint_p.
GCTA-COJO (--cojo-slct):
| SNP | freq | b | se | p | bJ | bJ_se | pJ |
|---|---|---|---|---|---|---|---|
| rs51 | 0.30 | 0.7495 | 0.04705 | 3.924e-57 | 0.779143 | 0.04999 | 9.036e-55 |
| rs261 | 0.30 | 0.6191 | 0.04799 | 4.501e-38 | 0.639229 | 0.05003 | 2.195e-37 |
| rs471 | 0.30 | 0.5056 | 0.04866 | 2.754e-25 | 0.502901 | 0.04995 | 7.721e-24 |
| rs681 | 0.30 | 0.4850 | 0.04874 | 2.522e-23 | 0.474079 | 0.05003 | 2.644e-21 |
| rs891 | 0.30 | 0.2630 | 0.04959 | 1.139e-07 | 0.325802 | 0.05005 | 7.568e-11 |
cojopy (slct):
| SNP | original_beta | original_se | original_p | joint_beta | joint_se | joint_p |
|---|---|---|---|---|---|---|
| rs51 | 0.7495 | 0.04705 | 6.790e-54 | 0.779143 | 0.04999 | 9.036e-55 |
| rs261 | 0.6191 | 0.04799 | 1.185e-36 | 0.639229 | 0.05003 | 2.195e-37 |
| rs471 | 0.5056 | 0.04866 | 1.100e-24 | 0.502901 | 0.04995 | 7.721e-24 |
| rs681 | 0.4850 | 0.04874 | 8.487e-23 | 0.474079 | 0.05003 | 2.644e-21 |
| rs891 | 0.2630 | 0.04959 | 1.265e-07 | 0.325802 | 0.05005 | 7.568e-11 |
Joint Analysis (joint)¶
Given the same 5 SNPs, the joint results are identical to slct (deterministic computation, same SNP set).
Conditional Analysis (cond)¶
Conditional on rs51, first 10 SNPs shown. Column mapping: GCTA bC/bC_se/pC = cojopy cond_beta/cond_se/cond_p.
GCTA-COJO (--cojo-cond):
| SNP | b | se | p | bC | bC_se | pC |
|---|---|---|---|---|---|---|
| rs1 | 0.0768 | 0.04990 | 0.1238 | 0.03221 | 0.04992 | 0.5187 |
| rs2 | 0.04534 | 0.04993 | 0.3639 | 0.006084 | 0.04993 | 0.9030 |
| rs3 | 0.03242 | 0.04994 | 0.5162 | -0.007728 | 0.04993 | 0.8770 |
| rs4 | 0.04687 | 0.04993 | 0.3479 | -0.009332 | 0.04993 | 0.8517 |
| rs5 | 0.01306 | 0.04994 | 0.7937 | -0.02263 | 0.04993 | 0.6504 |
| rs6 | 0.02343 | 0.04994 | 0.6389 | -0.02385 | 0.04993 | 0.6329 |
| rs7 | 0.0008159 | 0.04993 | 0.9870 | -0.03931 | 0.04992 | 0.4310 |
| rs8 | 0.001178 | 0.04994 | 0.9812 | -0.04520 | 0.04992 | 0.3653 |
| rs9 | 0.006928 | 0.04995 | 0.8897 | -0.04036 | 0.04994 | 0.4189 |
| rs10 | 0.03566 | 0.04993 | 0.4751 | -0.03034 | 0.04992 | 0.5434 |
cojopy (cond):
| SNP | original_beta | original_se | original_p | cond_beta | cond_se | cond_p |
|---|---|---|---|---|---|---|
| rs1 | 0.0768 | 0.04990 | 0.1240 | 0.03221 | 0.04992 | 0.5187 |
| rs2 | 0.04534 | 0.04993 | 0.3640 | 0.006084 | 0.04993 | 0.9030 |
| rs3 | 0.03242 | 0.04994 | 0.5163 | -0.007728 | 0.04993 | 0.8770 |
| rs4 | 0.04687 | 0.04993 | 0.3480 | -0.009332 | 0.04993 | 0.8517 |
| rs5 | 0.01306 | 0.04994 | 0.7937 | -0.02263 | 0.04993 | 0.6504 |
| rs6 | 0.02343 | 0.04994 | 0.6390 | -0.02385 | 0.04993 | 0.6329 |
| rs7 | 0.0008159 | 0.04993 | 0.9870 | -0.03931 | 0.04992 | 0.4310 |
| rs8 | 0.001178 | 0.04994 | 0.9812 | -0.04520 | 0.04992 | 0.3653 |
| rs9 | 0.006928 | 0.04995 | 0.8897 | -0.04036 | 0.04994 | 0.4189 |
| rs10 | 0.03566 | 0.04993 | 0.4752 | -0.03034 | 0.04992 | 0.5434 |
Why the Results Match¶
Both GCTA-COJO and cojopy implement the same formulas from Yang et al. (2012):
- Phenotype variance (Eq. 8): median of per-SNP variance estimates.
- Effective sample size (Eq. 13): accounts for the effect each SNP explains.
- Joint effects (Eq. 12):
beta_joint = (X'X)^{-1} D beta_marginal, whereX'Xis reconstructed from LD correlations, allele frequencies, and effective sample sizes. - Conditional effects: same matrix algebra, partitioning selected vs. unselected SNPs.
The only source of numerical difference is floating-point arithmetic order, which explains the ~10⁻⁷ magnitude of discrepancies.
Reproducing the Validation¶
The script automatically downloads GCTA and PLINK (first run only), generates all data, runs both tools, and reports pass/fail with scatter plots in simulation/output/comparison/.
Reference¶
Yang J, Ferreira T, Morris AP, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nature Genetics. 2012;44(4):369-375. doi:10.1038/ng.2213