Summary Statistics Munging¶
The credtools munge command standardizes GWAS summary statistics from various formats into a consistent format suitable for fine-mapping analysis. This preprocessing step ensures all input data follows a uniform schema before entering the fine-mapping pipeline.
Overview¶
Different GWAS studies use different column names, data types, and conventions for their summary statistics. The munging process handles these inconsistencies automatically:
- Standardizes column names to the credtools schema (CHR, BP, EA, NEA, BETA, SE, P, etc.)
- Validates and cleans data — removes invalid variants (e.g., non-ACGT alleles, out-of-range p-values, zero standard errors)
- Generates unique SNP identifiers in
chr-bp-allele1-allele2format - Supports multiple input formats including tab/comma/space-delimited and gzip-compressed files
- Handles multi-ancestry datasets with consistent processing across populations
Why munge first?
Fine-mapping tools are sensitive to data quality issues. Running credtools munge ensures your summary statistics are clean, consistently formatted, and free of common problems (duplicate variants, invalid alleles, numerical artifacts) before analysis.
Quick Start¶
Try It with Test Data¶
credtools ships with example data so you can verify your installation immediately:
credtools munge \
"exampledata/test_mock_data/EUR_all_loci.sumstats,exampledata/test_mock_data/AFR_all_loci.sumstats,exampledata/test_mock_data/EAS_all_loci.sumstats" \
/tmp/munge_output/ \
--force
Expected output:
Munging summary statistics...
Loaded 3 input file(s) from direct paths
Munging files: 100%|██████████| 3/3 [00:02<00:00, 1.06it/s]
Successfully munged 3 files
✓ EUR_all_loci: 10000 variants -> /tmp/munge_output/EUR_all_loci.munged.txt.gz
✓ AFR_all_loci: 10000 variants -> /tmp/munge_output/AFR_all_loci.munged.txt.gz
✓ EAS_all_loci: 10000 variants -> /tmp/munge_output/EAS_all_loci.munged.txt.gz
Input Formats¶
Option 1: Direct File Paths¶
Pass one or more file paths directly. Multiple files are separated by commas:
# Single file
credtools munge /path/to/gwas.txt output_dir/
# Multiple files (comma-separated, no spaces)
credtools munge "/path/to/eur.txt,/path/to/afr.txt,/path/to/eas.txt" output_dir/
The file stem (filename without extension) is used as the identifier for each output file.
Option 2: Population Configuration File¶
For multi-ancestry studies, a tab-delimited configuration file provides richer metadata. The file must contain four columns:
| Column | Description |
|---|---|
popu |
Population/ancestry label (e.g., EUR, AFR, EAS) |
cohort |
Cohort or study name |
sample_size |
Sample size for this population |
path |
Path to the summary statistics file |
Example population_config.txt:
popu cohort sample_size path
EUR UKBB 400000 /data/eur_gwas.txt
AFR AAGC 50000 /data/afr_gwas.txt
EAS BBJ 180000 /data/eas_gwas.txt
Tip
Using a population config file is recommended for multi-ancestry fine-mapping, as the metadata (population labels, sample sizes) is carried forward to downstream steps.
Command Reference¶
Arguments:
| Argument | Description |
|---|---|
INPUT_CONFIG |
File path(s) or population config file |
OUTPUT_DIR |
Output directory for munged files |
Options:
| Option | Short | Description | Default |
|---|---|---|---|
--config |
-c |
JSON file specifying column name mappings | None |
--force |
-f |
Overwrite existing output files | False |
--interactive |
-i |
Interactively create column mapping configuration | False |
--log-file |
-l |
Write log output to a file | None |
Column Auto-Detection¶
The munger automatically recognizes common GWAS column name conventions. You do not need a config file if your input uses any of these names:
| Standard Name | Recognized Aliases |
|---|---|
| CHR | CHROM, #CHROM, chromosome, Chromosome |
| BP | POS, Position, position, base_pair_location, pos |
| EA | A1, effect_allele, ALT, Allele1 |
| NEA | A2, other_allele, REF, Allele2 |
| BETA | beta, Beta, effect, Effect |
| SE | StdErr, stderr, standard_error |
| P | PVAL, P_BOLT_LMM, pvalue, P-value, p_value |
| EAF | FRQ, FREQ, frequency, Freq1 |
| RSID | rsid, rs |
| N | n, sample_size, NMISS |
If auto-detection fails, unrecognized headers are matched using fuzzy substring matching (e.g., "chromosome_id" → CHR, "base_position" → BP).
When do I need a column mapping config?
Only when your file uses column names that cannot be auto-detected. For most standard GWAS formats (PLINK, BOLT-LMM, SAIGE, REGENIE, etc.), auto-detection works out of the box.
Column Mapping Configuration¶
For files with non-standard column names, provide a JSON configuration file via --config:
{
"column_mapping": {
"chromosome_name": "CHR",
"genomic_position": "BP",
"tested_allele": "EA",
"reference_allele": "NEA",
"effect_size": "BETA",
"standard_error": "SE",
"p_value": "P"
}
}
Warning
The mapping direction is original_name → standard_name. Make sure the keys match your input file's actual column names exactly.
Interactive Configuration¶
If you are unsure about column mappings, use --interactive to let credtools guide you:
This will:
- Examine the headers of your input files
- Suggest mappings based on auto-detection
- Save the configuration to a JSON file for reuse
- Apply the mapping and munge your files
Output Format¶
Output Files¶
Each input file produces a gzip-compressed, tab-delimited output file:
The filename follows the pattern {identifier}.munged.txt.gz, where the identifier comes from:
- Population config:
{popu}_{cohort} - Direct file paths: the file stem (e.g.,
my_gwas.txt→my_gwas)
Output Columns¶
All munged files contain exactly 11 columns in this order:
| Column | Type | Description | Nullable |
|---|---|---|---|
| CHR | int8 | Chromosome (1–23, where 23 = X) | No |
| BP | int32 | Base pair position (GRCh37/hg19) | No |
| SNPID | str | Unique identifier (chr-bp-allele1-allele2) |
No |
| EA | str | Effect allele (uppercase ACGT) | No |
| NEA | str | Non-effect allele (uppercase ACGT) | No |
| EAF | float32 | Effect allele frequency | Yes |
| BETA | float32 | Effect size (log-odds for binary traits) | No |
| SE | float32 | Standard error of BETA | No |
| P | float64 | Association p-value | No |
| N | int32 | Sample size | Yes |
| RSID | str | dbSNP rsID | Yes |
Data Cleaning Rules¶
During munging, the following quality filters are applied:
| Filter | Rule | Rationale |
|---|---|---|
| Invalid chromosome | CHR not in 1–23 → removed | Only autosomal + X supported |
| Invalid position | BP < 0 or BP > 300 Mb → removed | Outside human chromosome range |
| Invalid alleles | Not matching ^[ACGT]+$ → removed |
Only standard DNA bases allowed |
| Identical alleles | EA == NEA → removed | Non-informative variant |
| Zero p-value | P = 0 → removed | Numerical underflow artifact |
| P-value > 1 | P > 1 → removed | Invalid statistical value |
| Zero/negative SE | SE ≤ 0 → removed | Impossible standard error |
| Duplicate variants | Same SNPID → keep smallest P | Retain most significant signal |
Integration with Downstream Steps¶
Munged files feed directly into the credtools pipeline:
graph LR
A[Raw GWAS files] -->|credtools munge| B[Munged files]
B -->|credtools chunk| C[Locus-specific files + LD matrices]
C -->|credtools finemap| D[Credible sets]
# Step 1: Munge summary statistics
credtools munge population_config.txt munged/
# Step 2: Identify loci, chunk data, and extract LD matrices
credtools chunk munged/sumstat_info_updated.txt chunks/
# Step 3: Run fine-mapping
credtools finemap chunks/loci_list.txt results/
Troubleshooting¶
My file has non-standard column names and auto-detection fails
Use --interactive mode to create a column mapping configuration, or manually create a JSON config file with the --config option. See Column Mapping Configuration above.
I get 'Missing mandatory columns' after munging
Your input file is missing one or more of: chromosome, position, effect allele, non-effect allele, beta, standard error, or p-value. Check that these columns exist (possibly under different names) and provide a column mapping if needed.
Many variants are removed during munging
This is expected for raw GWAS output. Common causes include:
- Non-biallelic variants or indels with non-ACGT encoding
- Variants on non-autosomal chromosomes (Y, MT)
- Duplicated variant entries
- Variants with missing or invalid statistics
Run with --log-file log.txt to see detailed per-step filtering counts.
Output file already exists and is skipped
By default, credtools munge skips files that already exist. Use --force / -f to overwrite:
Memory issues with very large files
For genome-wide files with tens of millions of variants, ensure sufficient RAM (typically 4–8 GB). If memory is limited, consider pre-filtering by chromosome or MAF before munging.
Best Practices¶
Recommendations
- Always keep original files — munging creates new files and never modifies the input
- Use population config files for multi-ancestry studies to preserve metadata
- Save column mapping configs for reproducibility with similar datasets
- Check the validation output — ensure all files show ✓ before proceeding
- Use
--log-filefor a detailed audit trail of all filtering steps