Methodology: LD-based Conditional & Joint Analysis¶

This document provides a self-contained mathematical derivation of the formulas used in cojopy, following Yang et al. (2012) Nature Genetics. Each section maps equations to the corresponding code paths so that readers can trace every quantity from theory to implementation.

1. Notation¶

Symbol	Meaning	Code variable
\(y\)	Phenotype vector (\(N \times 1\))	— (not stored; summary-level only)
\(x_j\)	Genotype vector for SNP \(j\) (\(N \times 1\)), coded as 0/1/2 dosage	—
\(X\)	Genotype matrix (\(N \times p\))	—
\(\hat{\beta}_j\)	Marginal OLS effect of SNP \(j\)	`beta`, `original_beta`
\(\text{SE}(\hat{\beta}_j)\)	Standard error of \(\hat{\beta}_j\)	`se`, `original_se`
\(p_j\)	P-value for \(\hat{\beta}_j\)	`p`, `original_p`
\(f_j\)	Effect-allele frequency of SNP \(j\)	`freq`
\(N_j\)	Sample size for SNP \(j\)	`n`
\(r_{jk}\)	LD (Pearson) correlation between SNPs \(j\) and \(k\)	`ld_matrix[j, k]`
\(V_p\)	Phenotype variance	`pheno_var`
\(\text{Var}(x_j)\)	Genotype variance \(= 2f_j(1-f_j)\) under HWE	`var_x`
\(D\)	Diagonal matrix with \(D_{jj} = N_j \cdot \text{Var}(x_j)\)	`D`
\(\hat{\beta}_J\)	Joint (multi-SNP) effect estimates	`joint_betas`
\(\hat{\beta}_{i \mid S}\)	Conditional effect of SNP \(i\) given selected set \(S\)	`cond_beta`
\(S\)	Set of selected (conditioning) SNP indices	`selected_indices`

2. Linear Regression Model¶

2.1 Single-SNP marginal regression¶

For each SNP \(j\), the marginal model is:

\[ y = x_j \beta_j + \epsilon_j, \qquad \epsilon_j \sim \mathcal{N}(0, \sigma_{\epsilon_j}^2 I) \]

The OLS estimate is:

\[ \hat{\beta}_j = \frac{x_j' y}{x_j' x_j} \]

2.2 Multi-SNP joint regression¶

When fitting \(p\) SNPs simultaneously:

\[ y = X \beta + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma_\epsilon^2 I) \]

the OLS normal equations give:

\[ X'X \hat{\beta}_J = X'y \]

so

\[ \hat{\beta}_J = (X'X)^{-1} X'y \]

2.3 Marginal vs. joint effects¶

From the marginal estimate for SNP \(j\):

\[ \hat{\beta}_j = \frac{x_j' y}{x_j' x_j} \]

Multiplying both sides by \((x_j' x_j)\):

\[ (x_j' x_j)\hat{\beta}_j = x_j' y \]

In matrix form for all SNPs:

\[ D \hat{\beta} \approx X'y \]

where \(D = \text{diag}(x_1'x_1, \ldots, x_p'x_p)\) collects the diagonal entries of \(X'X\). This identity bridges marginal and joint effects:

\[ \hat{\beta}_J = (X'X)^{-1} D \hat{\beta} \]

Code reference

This relationship is the foundation of the entire COJO method — it allows computing joint effects from marginal summary statistics plus an LD matrix, without access to individual-level data.

3. From Individual Data to Summary Statistics¶

3.1 OLS expressions¶

From standard OLS theory:

\[ \hat{\beta}_j = \frac{x_j' y}{x_j' x_j}, \qquad \text{SE}(\hat{\beta}_j) = \sqrt{\frac{\sigma_\epsilon^2}{x_j' x_j}} \]

Since \(x_j' x_j = N_j \cdot \text{Var}(x_j)\) when \(x_j\) is centred, and writing \(\sigma_\epsilon^2 \approx V_p\) (the residual variance is approximated by the phenotype variance for small per-SNP effects):

\[ \text{SE}(\hat{\beta}_j) \approx \sqrt{\frac{V_p}{N_j \cdot \text{Var}(x_j)}} \]

3.2 The \(X'X\) matrix and LD¶

The off-diagonal entries of \(X'X\) relate to LD:

\[ (X'X)_{jk} = \sum_{i=1}^{N} x_{ij} x_{ik} \approx N \cdot r_{jk} \cdot \sqrt{\text{Var}(x_j) \cdot \text{Var}(x_k)} \]

where \(r_{jk}\) is the Pearson correlation between genotype vectors \(x_j\) and \(x_k\). When SNPs \(j\) and \(k\) have different sample sizes (as in meta-analysis), \(N\) is replaced by \(\min(N_j, N_k)\).

4. Phenotype Variance Estimation — Eq. (8)¶

4.1 Derivation¶

Rearranging the SE expression from Section 3.1:

\[ \text{SE}(\hat{\beta}_j)^2 = \frac{V_p}{N_j \cdot \text{Var}(x_j)} \]

However, this ignores the variance explained by SNP \(j\) itself. A more accurate decomposition (Yang et al. 2012, Eq. 8) accounts for the SNP's own contribution:

\[ V_p = \text{Var}(x_j) \cdot N_j \cdot \text{SE}(\hat{\beta}_j)^2 + \text{Var}(x_j) \cdot \hat{\beta}_j^2 \cdot \frac{N_j}{N_j - 1} \]

Each SNP yields an independent estimate of \(V_p\). The median across all SNPs provides a robust estimator:

\[ \hat{V}_p = \text{median}_j \left\{ \text{Var}(x_j) \cdot N_j \cdot \text{SE}(\hat{\beta}_j)^2 + \text{Var}(x_j) \cdot \hat{\beta}_j^2 \cdot \frac{N_j}{N_j - 1} \right\} \]

4.2 Why median?¶

The median is preferred over the mean because a small number of SNPs with large effects can produce extreme per-SNP \(V_p\) estimates. The median is resistant to such outliers.

Code reference

_cal_pheno_var() — cojopy.py:661-667

var_x = 2 * freq * (1 - freq)
Vp_buf = var_x * n * se**2 + var_x * beta**2 * n / (n - 1)
return np.median(Vp_buf)

5. Effective Sample Size — Eq. (13)¶

5.1 Motivation¶

In meta-analysis, each SNP may be measured in a different subset of studies, so \(N_j\) varies. Even when \(N\) is constant, the "effective" sample size for reconstructing \((X'X)_{jj}\) must account for the variance explained by the SNP itself.

5.2 Derivation¶

Starting from the per-SNP variance decomposition (Eq. 8):

\[ V_p = \text{Var}(x_j) \cdot N_j \cdot \text{SE}(\hat{\beta}_j)^2 + \text{Var}(x_j) \cdot \hat{\beta}_j^2 \cdot \frac{N_j}{N_j - 1} \]

For \(N_j \gg 1\), \(\frac{N_j}{N_j-1} \approx 1\), so approximately:

\[ V_p \approx \text{Var}(x_j) \cdot N_j \cdot \text{SE}(\hat{\beta}_j)^2 + \text{Var}(x_j) \cdot \hat{\beta}_j^2 \]

Solving for \(N_j\):

\[ N_j = \frac{V_p - \text{Var}(x_j) \cdot \hat{\beta}_j^2} {\text{Var}(x_j) \cdot \text{SE}(\hat{\beta}_j)^2} + 1 \]

Given the global estimate \(\hat{V}_p\) from Section 4, this formula yields the effective sample size for each SNP.

Code reference

_cal_effective_n() — cojopy.py:669-680

var_x = 2 * freq * (1 - freq)
eff_n = (pheno_var - var_x * beta**2) / (var_x * se**2) + 1

6. Reconstructing \(X'X\) from Summary Statistics¶

With \(\hat{V}_p\) and per-SNP effective \(N_j\) in hand, we can reconstruct the \(X'X\) matrix entirely from summary statistics and an external LD reference.

6.1 Genotype variance under HWE¶

\[ \text{Var}(x_j) = 2 f_j (1 - f_j) \]

6.2 Off-diagonal entries¶

\[ (X'X)_{jk} = \min(N_j, N_k) \cdot r_{jk} \cdot \sqrt{\text{Var}(x_j) \cdot \text{Var}(x_k)}, \qquad j \neq k \]

The \(\min\) accounts for different sample sizes: only the overlapping samples contribute to the cross-product.

6.3 Diagonal entries¶

\[ (X'X)_{jj} = N_j \cdot \text{Var}(x_j) \]

6.4 The \(D\) matrix¶

The \(D\) matrix is simply the diagonal of \(X'X\):

\[ D = \text{diag}\bigl(N_1 \cdot \text{Var}(x_1),\; \ldots,\; N_p \cdot \text{Var}(x_p)\bigr) \]

Code reference

_calculate_joint_stats() — cojopy.py:458-469

XTX = np.zeros((n_snp, n_snp))
for j in range(n_snp):
    for k in range(n_snp):
        if j == k:
            XTX[j, k] = eff_n[j] * var_x[j]
        else:
            XTX[j, k] = (
                min(eff_n[j], eff_n[k])
                * ld_selected[j, k]
                * np.sqrt(var_x[j] * var_x[k])
            )

7. Joint Analysis — Eq. (12)¶

7.1 Joint effect estimates¶

From Section 2.3, the joint effects are:

\[ \hat{\beta}_J = (X'X)^{-1} D \hat{\beta} \]

where \(\hat{\beta}\) is the vector of marginal effects, \(D\) is the diagonal matrix from Section 6.4, and \(X'X\) is the reconstructed matrix from Section 6.

7.2 Derivation¶

Starting from the normal equations:

\[ X'X \hat{\beta}_J = X'y \]

We showed in Section 2.3 that \(D\hat{\beta} \approx X'y\). Substituting:

\[ X'X \hat{\beta}_J = D \hat{\beta} \]

\[ \hat{\beta}_J = (X'X)^{-1} D \hat{\beta} \]

7.3 Joint standard errors¶

The variance-covariance matrix of \(\hat{\beta}_J\) is:

\[ \text{Var}(\hat{\beta}_J) = \sigma_\epsilon^2 (X'X)^{-1} \approx V_p (X'X)^{-1} \]

Therefore:

\[ \text{SE}(\hat{\beta}_{J,j}) = \sqrt{V_p \cdot \bigl[(X'X)^{-1}\bigr]_{jj}} \]

7.4 Joint P-values¶

\[ z_j = \frac{\hat{\beta}_{J,j}}{\text{SE}(\hat{\beta}_{J,j})}, \qquad p_j = 2\,\Phi(-|z_j|) \]

where \(\Phi\) is the standard normal CDF.

Code reference

_calculate_joint_stats() — cojopy.py:471-509

# Joint effects (Eq. 12)
joint_betas = XTX_inv @ D @ beta_selected

# Joint SEs
var_joint = self.pheno_var * XTX_inv.diagonal()
joint_ses = np.sqrt(var_joint)

8. Conditional Analysis¶

8.1 Setup¶

Partition the SNPs into a selected (conditioning) set \(S\) and the remaining set \(S^c\). We want the effect of each SNP \(i \in S^c\) conditional on all SNPs in \(S\).

Define the block structure of \(X'X\):

\[ X'X = \begin{pmatrix} B_1 & C' \\ C & B_2 \end{pmatrix} \]

where:

\(B_1 = (X'X)_{S,S}\) — the \(|S| \times |S|\) sub-matrix among selected SNPs
\(C_i = (X'X)_{i,S}\) — the row vector linking SNP \(i\) to the selected set
\(B_{2,ii} = (X'X)_{ii} = D_{ii}\) — the diagonal entry for SNP \(i\)

8.2 Conditional effect¶

For a single SNP \(i \in S^c\), the conditional effect given \(S\) is:

\[ \hat{\beta}_{i \mid S} = \hat{\beta}_i - \frac{C_i \, B_1^{-1} \, D_S \, \hat{\beta}_S}{D_{ii}} \]

where:

\(\hat{\beta}_i\) is the marginal effect of SNP \(i\)
\(\hat{\beta}_S\) is the vector of marginal effects for SNPs in \(S\)
\(D_S = \text{diag}(N_j \cdot \text{Var}(x_j))_{j \in S}\) is the diagonal sub-matrix for the selected set
\(D_{ii} = N_i \cdot \text{Var}(x_i)\)

8.3 Derivation sketch¶

From the partitioned normal equations, the full system is:

\[ \begin{pmatrix} B_1 & C_i' \\ C_i & D_{ii} \end{pmatrix} \begin{pmatrix} \hat{\beta}_{S \mid i} \\ \hat{\beta}_{i \mid S} \end{pmatrix} = \begin{pmatrix} D_S \hat{\beta}_S \\ D_{ii} \hat{\beta}_i \end{pmatrix} \]

From the second row:

\[ C_i \hat{\beta}_{S \mid i} + D_{ii} \hat{\beta}_{i \mid S} = D_{ii} \hat{\beta}_i \]

From the first row, \(\hat{\beta}_{S \mid i} \approx B_1^{-1} D_S \hat{\beta}_S\) (treating the contribution of single SNP \(i\) as negligible). Substituting:

\[ D_{ii} \hat{\beta}_{i \mid S} = D_{ii} \hat{\beta}_i - C_i B_1^{-1} D_S \hat{\beta}_S \]

\[ \hat{\beta}_{i \mid S} = \hat{\beta}_i - \frac{C_i B_1^{-1} D_S \hat{\beta}_S}{D_{ii}} \]

8.4 Conditional standard error¶

\[ \text{SE}(\hat{\beta}_{i \mid S}) = \sqrt{\frac{V_p}{D_{ii}}} \]

This follows from the fact that, after conditioning, the residual variance for SNP \(i\) is approximately \(V_p / D_{ii}\), treating the selected SNPs as fixed covariates.

8.5 Conditional P-value¶

\[ z_i = \frac{\hat{\beta}_{i \mid S}}{\text{SE}(\hat{\beta}_{i \mid S})}, \qquad p_i = 2\,\Phi(-|z_i|) \]

Code reference

_calculate_conditional_stats() — cojopy.py:329-433

# C_i @ B1_inv
Z_Bi = C @ B1_inv
# Conditional beta
adjustment = (Z_Bi * D1.diagonal()) @ beta_selected / D2.diagonal()[i]
cond_beta[idx] = beta_unselected[i] - adjustment
# Conditional SE
cond_se[idx] = np.sqrt(pheno_var / D2.diagonal()[i])

9. Stepwise Selection Algorithm¶

The stepwise model selection procedure combines conditional analysis (forward step) and joint analysis (backward elimination) in an iterative loop.

9.1 Forward step¶

Initialise: select the SNP with the smallest marginal P-value if \(p < p_\text{cutoff}\) (default \(5 \times 10^{-8}\)).
Iterate: compute conditional statistics for all remaining SNPs given the current selected set \(S\); identify the SNP with the smallest conditional P-value.
Collinearity check: before adding a candidate SNP, verify \(r^2 < r^2_\text{cutoff}\) (default 0.9) with every SNP already in \(S\). If the check fails, exclude the candidate and continue.
Add: if the conditional P-value passes the threshold and collinearity check, add the SNP to \(S\).

9.2 Backward elimination¶

After each forward addition, perform joint analysis on all SNPs in \(S\). Remove any SNP whose joint P-value exceeds \(p_\text{cutoff}\). Record removed SNPs to prevent re-selection in subsequent iterations (avoids infinite loops).

9.3 Termination¶

The algorithm stops when:

No remaining SNP has a conditional P-value below \(p_\text{cutoff}\), or
The candidate SNP was previously removed by backward elimination, or
No available SNPs remain.

9.4 Final output¶

After convergence, a final joint analysis is run on the selected set to produce the reported joint effect estimates, SEs, and P-values.

Code reference

conditional_selection() — cojopy.py:156-327

The forward step is the while continue_selection loop; backward elimination is handled by _backward_elimination() at cojopy.py:627-653.

10. Implementation Notes¶

10.1 Log-space P-value computation¶

For highly significant associations, P-values can underflow to zero in 64-bit floating point. cojopy computes P-values in log space to avoid this:

\[ \log p = \log 2 + \log \Phi(-|z|) \]

and then exponentiates, clamping to the smallest representable positive float (np.finfo(np.float64).tiny).

Code reference

_calculate_p_value() — cojopy.py:655-659

log_sf = norm.logsf(abs(z_scores))
log_p = np.log(2) + log_sf
return np.maximum(np.exp(log_p), np.finfo(np.float64).tiny)

10.2 Singular matrix handling¶

When the selected SNP set includes near-collinear variants, \(X'X\) or \(B_1\) may be numerically singular. In such cases, the code falls back to the Moore-Penrose pseudo-inverse (np.linalg.pinv) and logs a warning.

10.3 Negative variance¶

If \((X'X)^{-1}_{jj} < 0\) (which can occur with ill-conditioned matrices), the joint SE becomes undefined. The code sets SE = \(\infty\) and P = 1.0 for such SNPs, effectively excluding them from interpretation.

11. Reference¶

Yang J, Ferreira T, Morris AP, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nature Genetics. 2012;44(4):369-375. doi:10.1038/ng.2213