Meta-Analysis¶
The credtools meta command performs meta-analysis of summary statistics and LD matrices across multiple ancestries or studies. This step combines evidence from different populations to improve fine-mapping resolution and power.
Overview¶
Meta-analysis in credtools integrates genetic evidence across populations while accounting for different LD structures and effect sizes. The meta-analysis process:
- Combines summary statistics using inverse-variance weighted fixed-effects meta-analysis
- Merges LD matrices using sample-size weighted averaging
- Computes heterogeneity metrics before combining data
- Creates unified datasets for downstream fine-mapping
- Supports flexible strategies — combine all, by population, or keep separate
- Preserves ancestry-specific information when needed
Why meta-analyze?
Multi-ancestry fine-mapping leverages diverse LD patterns across populations to narrow credible sets. Even when effect sizes are similar, different LD structures help distinguish causal variants from their tagging neighbors.
Quick Start¶
Try It with Test Data¶
credtools meta exampledata/test_meta/loci_list.txt /tmp/meta_output/ \
--meta-method meta_all \
--threads 2
Input Format¶
The input is a tab-delimited loci_list.txt file produced by the credtools chunk step. It must contain the following columns:
| Column | Type | Description |
|---|---|---|
locus_id |
str | Locus identifier (e.g., chr1_1000_3000) |
prefix |
str | File path prefix for sumstats/LD files |
popu |
str | Population code (e.g., EUR, AFR, EAS) |
cohort |
str | Cohort or study name |
sample_size |
int | Sample size for this cohort |
chr |
int | Chromosome number |
start |
int | Locus start position |
end |
int | Locus end position |
Each locus_id can have multiple rows representing different cohorts/populations. Rows with the same locus_id must share the same chr, start, and end values.
Tip
The input file is typically chunked/loci_list.txt generated by credtools chunk. You do not need to create it manually.
Command Reference¶
Arguments:
| Argument | Description |
|---|---|
INPUTS |
Path to loci list file (tab-delimited) |
OUTDIR |
Output directory for meta-analysis results |
Options:
| Option | Short | Description | Default |
|---|---|---|---|
--meta-method |
-m |
Meta-analysis method (meta_all, meta_by_population, no_meta) |
meta_all |
--threads |
-t |
Number of parallel threads | 1 |
--calculate-lambda-s |
-cls |
Calculate lambda_s parameter using estimate_s_rss | False |
--log-file |
-l |
Write log output to a file | None |
Meta-Analysis Methods¶
Combines all ancestries into a single meta-analyzed dataset per locus.
- Maximizes statistical power by pooling all available data
- Summary statistics combined via inverse-variance weighting (IVW)
- LD matrices combined via sample-size weighted averaging
- Population and cohort labels joined with
+(e.g.,AFR+EUR) - Best for traits with consistent effects across ancestries
Performs meta-analysis within each population separately.
- Groups input loci by population code
- Multi-cohort populations are meta-analyzed (same as
meta_allwithin group) - Single-cohort populations are intersected without meta-analysis
- Preserves population-specific LD patterns
- Suitable when effect sizes differ between populations
Processes each cohort independently — no combining.
- Each input locus is intersected (sumstats ∩ LD) but not combined
- Preserves all population- and cohort-specific information
- Required for multi-ancestry fine-mapping tools (e.g., SuSiEx, MuSuSiE)
- Useful for comparing results across populations
Output Format¶
Output Files¶
Each locus produces a directory containing meta-analyzed data and heterogeneity assessments:
meta_output/
├── loci_info.txt # Updated loci info for downstream steps
├── heterogeneity.txt.gz # Global heterogeneity summary
├── chr1_1000_3000/ # Per-locus directory
│ ├── AFR+EUR_meta2cohorts_a1b2c3d4.sumstats.gz # Meta-analyzed summary statistics
│ ├── AFR+EUR_meta2cohorts_a1b2c3d4.ld.npz # Meta-analyzed LD matrix (float16)
│ ├── AFR+EUR_meta2cohorts_a1b2c3d4.ldmap.gz # LD variant map
│ ├── heterogeneity.txt.gz # Per-locus heterogeneity summary
│ ├── ld_4th_moment.txt.gz # LD 4th moment metric
│ ├── ld_decay.txt.gz # LD decay analysis
│ ├── cochran_q.txt.gz # Cochran's Q (multi-cohort only)
│ └── snp_missingness.txt.gz # SNP missingness (multi-cohort only)
└── chr2_5000_8000/
└── ...
The file prefix follows the pattern {popu}_{cohort} for single cohorts, or {popu}_meta{N}cohorts_{hash} for meta-analyzed results.
Output Columns (sumstats.gz)¶
Meta-analyzed summary statistics contain the standard credtools schema:
| Column | Type | Description |
|---|---|---|
| SNPID | str | Unique variant identifier (chr-bp-allele1-allele2) |
| CHR | int8 | Chromosome |
| BP | int32 | Base pair position |
| EA | str | Effect allele |
| NEA | str | Non-effect allele |
| EAF | float32 | Sample-size weighted effect allele frequency |
| BETA | float32 | IVW meta-analysis effect size |
| SE | float32 | Meta-analysis standard error |
| P | float64 | Meta-analysis p-value |
Heterogeneity Analysis¶
Before meta-analysis combines data, credtools automatically computes heterogeneity metrics to assess consistency across input cohorts. This helps identify loci where meta-analysis may be inappropriate.
When is heterogeneity computed?
Heterogeneity is always computed on the original per-cohort data before any combining. The results are saved even when using no_meta mode.
Per-Locus Heterogeneity Files¶
| File | Description | When produced |
|---|---|---|
ld_4th_moment.txt.gz |
4th moment of LD matrix entries per cohort — indicates LD distribution shape | Always |
ld_decay.txt.gz |
LD decay rate per cohort — how LD decreases with genomic distance | Always |
cochran_q.txt.gz |
Cochran's Q test for effect size heterogeneity across cohorts | Multi-cohort only |
snp_missingness.txt.gz |
SNP presence/absence matrix across cohorts | Multi-cohort only |
Heterogeneity Summary Table¶
The heterogeneity.txt.gz file (both per-locus and global) contains one row per cohort:
| Column | Type | Description |
|---|---|---|
locus_id |
str | Locus identifier |
popu |
str | Population code (e.g., EUR, AFR) |
cohort |
str | Cohort name (e.g., UKB, MVP) |
ld_4th_moment_mean |
float | Mean 4th moment of LD matrix for this cohort |
ld_decay_rate |
float | Exponential decay rate of LD with distance |
missing_rate |
float | Fraction of SNPs missing relative to the union |
cochran_q_median |
float | Median Cochran's Q statistic across SNPs |
i_squared_median |
float | Median I² heterogeneity index across SNPs |
n_het_snps |
int | Number of SNPs with significant heterogeneity (Q p-value < 0.05) |
Interpreting heterogeneity
- High
ld_decay_ratedifferences between cohorts suggest divergent LD structures - Large
n_het_snpsindicates many variants with inconsistent effect sizes - High
missing_ratefor a cohort means poor variant overlap — consider filtering or usingno_meta
Choosing the Right Method¶
| Scenario | Recommended Method |
|---|---|
| Effect sizes consistent across populations | meta_all |
| Maximum statistical power needed | meta_all |
| Multiple studies per ancestry, some cross-ancestry heterogeneity | meta_by_population |
| Studies within ancestry are more homogeneous than across | meta_by_population |
| Effect sizes differ substantially between populations | no_meta |
| Using multi-ancestry tools (SuSiEx, MuSuSiE) | no_meta |
| Comparing ancestry-specific signals | no_meta |
Integration with Workflow¶
Meta-analysis fits into the credtools pipeline after chunking:
graph LR
A[Munged GWAS files] -->|credtools chunk| B[Per-locus files + LD matrices]
B -->|credtools meta| C[Meta-analyzed loci + heterogeneity]
C -->|credtools qc| D[QC'd loci]
D -->|credtools finemap| E[Credible sets]
# Step 1: Identify loci, chunk data, and extract LD matrices
credtools chunk munged/sumstat_info_updated.txt chunked/
# Step 2: Perform meta-analysis
credtools meta chunked/loci_list.txt meta/ --meta-method meta_all
# Step 3: Run quality control (optional)
credtools qc meta/loci_info.txt qc_results/
# Step 4: Perform fine-mapping
credtools finemap meta/loci_info.txt finemap_results/
Troubleshooting¶
Memory issues with large datasets
Reduce the number of threads or process subsets of loci separately. Large LD matrices can consume significant RAM — each N×N float32 matrix uses ~4N² bytes.
Inconsistent ancestry labels
Ensure ancestry identifiers match exactly between summary statistics and LD matrices. Labels are case-sensitive (EUR ≠ eur).
Missing input files
Check that all required files from the credtools chunk step are present. Each row in the loci list requires {prefix}.sumstats.gz, {prefix}.ld.npz, and {prefix}.ldmap.gz.
ValueError: All input loci must have the same start/end position
This occurs when loci grouped by locus_id have different boundary coordinates. Ensure the start and end columns are consistent within each locus_id group.
Some loci produce warnings about curve fitting
The LD decay analysis uses exponential curve fitting, which may warn when the data does not fit well. This is informational and does not affect meta-analysis results.
Best Practices¶
Recommendations
- Review heterogeneity first — check
heterogeneity.txt.gzto identify problematic loci before proceeding to fine-mapping - Start with
meta_allfor maximum power, then compare withmeta_by_populationif heterogeneity is high - Use
--log-filefor a detailed audit trail of all processing steps - Match thread count to cores — generally use 1 thread per CPU core, reduce for memory-intensive analyses
- Use fast storage — place the output directory on SSD when possible for large studies
- Keep intermediate results — meta-analysis outputs are needed for reproducibility
- Use consistent identifiers — maintain consistent locus and ancestry naming throughout your workflow