Methodology: LD-based Conditional & Joint Analysis¶
This document provides a self-contained mathematical derivation of the formulas used in cojopy, following Yang et al. (2012) Nature Genetics. Each section maps equations to the corresponding code paths so that readers can trace every quantity from theory to implementation.
1. Notation¶
| Symbol | Meaning | Code variable |
|---|---|---|
| \(y\) | Phenotype vector (\(N \times 1\)) | — (not stored; summary-level only) |
| \(x_j\) | Genotype vector for SNP \(j\) (\(N \times 1\)), coded as 0/1/2 dosage | — |
| \(X\) | Genotype matrix (\(N \times p\)) | — |
| \(\hat{\beta}_j\) | Marginal OLS effect of SNP \(j\) | beta, original_beta |
| \(\text{SE}(\hat{\beta}_j)\) | Standard error of \(\hat{\beta}_j\) | se, original_se |
| \(p_j\) | P-value for \(\hat{\beta}_j\) | p, original_p |
| \(f_j\) | Effect-allele frequency of SNP \(j\) | freq |
| \(N_j\) | Sample size for SNP \(j\) | n |
| \(r_{jk}\) | LD (Pearson) correlation between SNPs \(j\) and \(k\) | ld_matrix[j, k] |
| \(V_p\) | Phenotype variance | pheno_var |
| \(\text{Var}(x_j)\) | Genotype variance \(= 2f_j(1-f_j)\) under HWE | var_x |
| \(D\) | Diagonal matrix with \(D_{jj} = N_j \cdot \text{Var}(x_j)\) | D |
| \(\hat{\beta}_J\) | Joint (multi-SNP) effect estimates | joint_betas |
| \(\hat{\beta}_{i \mid S}\) | Conditional effect of SNP \(i\) given selected set \(S\) | cond_beta |
| \(S\) | Set of selected (conditioning) SNP indices | selected_indices |
2. Linear Regression Model¶
2.1 Single-SNP marginal regression¶
For each SNP \(j\), the marginal model is:
The OLS estimate is:
2.2 Multi-SNP joint regression¶
When fitting \(p\) SNPs simultaneously:
the OLS normal equations give:
so
2.3 Marginal vs. joint effects¶
From the marginal estimate for SNP \(j\):
Multiplying both sides by \((x_j' x_j)\):
In matrix form for all SNPs:
where \(D = \text{diag}(x_1'x_1, \ldots, x_p'x_p)\) collects the diagonal entries of \(X'X\). This identity bridges marginal and joint effects:
Code reference
This relationship is the foundation of the entire COJO method — it allows computing joint effects from marginal summary statistics plus an LD matrix, without access to individual-level data.
3. From Individual Data to Summary Statistics¶
3.1 OLS expressions¶
From standard OLS theory:
Since \(x_j' x_j = N_j \cdot \text{Var}(x_j)\) when \(x_j\) is centred, and writing \(\sigma_\epsilon^2 \approx V_p\) (the residual variance is approximated by the phenotype variance for small per-SNP effects):
3.2 The \(X'X\) matrix and LD¶
The off-diagonal entries of \(X'X\) relate to LD:
where \(r_{jk}\) is the Pearson correlation between genotype vectors \(x_j\) and \(x_k\). When SNPs \(j\) and \(k\) have different sample sizes (as in meta-analysis), \(N\) is replaced by \(\min(N_j, N_k)\).
4. Phenotype Variance Estimation — Eq. (8)¶
4.1 Derivation¶
Rearranging the SE expression from Section 3.1:
However, this ignores the variance explained by SNP \(j\) itself. A more accurate decomposition (Yang et al. 2012, Eq. 8) accounts for the SNP's own contribution:
Each SNP yields an independent estimate of \(V_p\). The median across all SNPs provides a robust estimator:
4.2 Why median?¶
The median is preferred over the mean because a small number of SNPs with large effects can produce extreme per-SNP \(V_p\) estimates. The median is resistant to such outliers.
Code reference
_cal_pheno_var() — cojopy.py:661-667
5. Effective Sample Size — Eq. (13)¶
5.1 Motivation¶
In meta-analysis, each SNP may be measured in a different subset of studies, so \(N_j\) varies. Even when \(N\) is constant, the "effective" sample size for reconstructing \((X'X)_{jj}\) must account for the variance explained by the SNP itself.
5.2 Derivation¶
Starting from the per-SNP variance decomposition (Eq. 8):
For \(N_j \gg 1\), \(\frac{N_j}{N_j-1} \approx 1\), so approximately:
Solving for \(N_j\):
Given the global estimate \(\hat{V}_p\) from Section 4, this formula yields the effective sample size for each SNP.
Code reference
_cal_effective_n() — cojopy.py:669-680
6. Reconstructing \(X'X\) from Summary Statistics¶
With \(\hat{V}_p\) and per-SNP effective \(N_j\) in hand, we can reconstruct the \(X'X\) matrix entirely from summary statistics and an external LD reference.
6.1 Genotype variance under HWE¶
6.2 Off-diagonal entries¶
The \(\min\) accounts for different sample sizes: only the overlapping samples contribute to the cross-product.
6.3 Diagonal entries¶
6.4 The \(D\) matrix¶
The \(D\) matrix is simply the diagonal of \(X'X\):
Code reference
_calculate_joint_stats() — cojopy.py:458-469
7. Joint Analysis — Eq. (12)¶
7.1 Joint effect estimates¶
From Section 2.3, the joint effects are:
where \(\hat{\beta}\) is the vector of marginal effects, \(D\) is the diagonal matrix from Section 6.4, and \(X'X\) is the reconstructed matrix from Section 6.
7.2 Derivation¶
Starting from the normal equations:
We showed in Section 2.3 that \(D\hat{\beta} \approx X'y\). Substituting:
7.3 Joint standard errors¶
The variance-covariance matrix of \(\hat{\beta}_J\) is:
Therefore:
7.4 Joint P-values¶
where \(\Phi\) is the standard normal CDF.
Code reference
_calculate_joint_stats() — cojopy.py:471-509
8. Conditional Analysis¶
8.1 Setup¶
Partition the SNPs into a selected (conditioning) set \(S\) and the remaining set \(S^c\). We want the effect of each SNP \(i \in S^c\) conditional on all SNPs in \(S\).
Define the block structure of \(X'X\):
where:
- \(B_1 = (X'X)_{S,S}\) — the \(|S| \times |S|\) sub-matrix among selected SNPs
- \(C_i = (X'X)_{i,S}\) — the row vector linking SNP \(i\) to the selected set
- \(B_{2,ii} = (X'X)_{ii} = D_{ii}\) — the diagonal entry for SNP \(i\)
8.2 Conditional effect¶
For a single SNP \(i \in S^c\), the conditional effect given \(S\) is:
where:
- \(\hat{\beta}_i\) is the marginal effect of SNP \(i\)
- \(\hat{\beta}_S\) is the vector of marginal effects for SNPs in \(S\)
- \(D_S = \text{diag}(N_j \cdot \text{Var}(x_j))_{j \in S}\) is the diagonal sub-matrix for the selected set
- \(D_{ii} = N_i \cdot \text{Var}(x_i)\)
8.3 Derivation sketch¶
From the partitioned normal equations, the full system is:
From the second row:
From the first row, \(\hat{\beta}_{S \mid i} \approx B_1^{-1} D_S \hat{\beta}_S\) (treating the contribution of single SNP \(i\) as negligible). Substituting:
8.4 Conditional standard error¶
This follows from the fact that, after conditioning, the residual variance for SNP \(i\) is approximately \(V_p / D_{ii}\), treating the selected SNPs as fixed covariates.
8.5 Conditional P-value¶
Code reference
_calculate_conditional_stats() — cojopy.py:329-433
9. Stepwise Selection Algorithm¶
The stepwise model selection procedure combines conditional analysis (forward step) and joint analysis (backward elimination) in an iterative loop.
9.1 Forward step¶
- Initialise: select the SNP with the smallest marginal P-value if \(p < p_\text{cutoff}\) (default \(5 \times 10^{-8}\)).
- Iterate: compute conditional statistics for all remaining SNPs given the current selected set \(S\); identify the SNP with the smallest conditional P-value.
- Collinearity check: before adding a candidate SNP, verify \(r^2 < r^2_\text{cutoff}\) (default 0.9) with every SNP already in \(S\). If the check fails, exclude the candidate and continue.
- Add: if the conditional P-value passes the threshold and collinearity check, add the SNP to \(S\).
9.2 Backward elimination¶
After each forward addition, perform joint analysis on all SNPs in \(S\). Remove any SNP whose joint P-value exceeds \(p_\text{cutoff}\). Record removed SNPs to prevent re-selection in subsequent iterations (avoids infinite loops).
9.3 Termination¶
The algorithm stops when:
- No remaining SNP has a conditional P-value below \(p_\text{cutoff}\), or
- The candidate SNP was previously removed by backward elimination, or
- No available SNPs remain.
9.4 Final output¶
After convergence, a final joint analysis is run on the selected set to produce the reported joint effect estimates, SEs, and P-values.
Code reference
conditional_selection() — cojopy.py:156-327
The forward step is the while continue_selection loop; backward elimination
is handled by _backward_elimination() at cojopy.py:627-653.
10. Implementation Notes¶
10.1 Log-space P-value computation¶
For highly significant associations, P-values can underflow to zero in 64-bit floating point. cojopy computes P-values in log space to avoid this:
and then exponentiates, clamping to the smallest representable positive float
(np.finfo(np.float64).tiny).
Code reference
_calculate_p_value() — cojopy.py:655-659
10.2 Singular matrix handling¶
When the selected SNP set includes near-collinear variants, \(X'X\) or \(B_1\)
may be numerically singular. In such cases, the code falls back to the
Moore-Penrose pseudo-inverse (np.linalg.pinv) and logs a warning.
10.3 Negative variance¶
If \((X'X)^{-1}_{jj} < 0\) (which can occur with ill-conditioned matrices), the joint SE becomes undefined. The code sets SE = \(\infty\) and P = 1.0 for such SNPs, effectively excluding them from interpretation.
11. Reference¶
Yang J, Ferreira T, Morris AP, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nature Genetics. 2012;44(4):369-375. doi:10.1038/ng.2213