Loci Identification and Chunking¶
The credtools chunk command identifies independent genetic loci from munged summary statistics and splits the data into locus-specific files ready for fine-mapping. This step defines the genomic regions that will be analyzed independently.
Overview¶
After munging your summary statistics, you need to identify independent genetic signals and create focused datasets for fine-mapping. The chunking process:
- Identifies independent loci based on distance and significance thresholds
- Merges overlapping regions across different ancestries when appropriate
- Creates locus-specific files containing only variants within each region
- Generates credtools-compatible loci lists for downstream analysis
- Handles multi-ancestry data with consistent loci across populations
Prerequisites
You must run credtools munge before chunking. The chunk command expects munged summary statistics files (.munged.txt.gz) as input. See the Munge documentation for details.
Quick Start¶
Try It with Test Data¶
credtools ships with example data so you can verify your installation immediately:
Tip
You can also pass comma-separated file paths directly instead of a config file:
Expected output:
Loaded 3 population files from config
Identifying independent loci...
Processing ancestries: 100%|██████████| 3/3
Merging overlapping loci across ancestries
Chunking 5 loci...
Chunking loci: 100%|██████████| 5/5
Successfully processed 5 loci
Generated 15 chunked files
Input Formats¶
Option 1: Direct File Paths¶
Pass one or more munged file paths directly. Multiple files are separated by commas:
# Single file
credtools chunk EUR.munged.txt.gz output_dir/
# Multiple files (comma-separated, no spaces)
credtools chunk "EUR.munged.txt.gz,AFR.munged.txt.gz,EAS.munged.txt.gz" output_dir/
The file stem (e.g., EUR_cohort1 from EUR_cohort1.munged.txt.gz) is used as the ancestry/cohort key.
Option 2: Population Configuration File¶
For multi-ancestry studies, a tab-delimited configuration file provides richer metadata. This is the same format used by credtools munge:
| Column | Description |
|---|---|
popu |
Population/ancestry label (e.g., EUR, AFR, EAS) |
cohort |
Cohort or study name |
sample_size |
Sample size for this population |
path |
Path to the munged summary statistics file |
ld_ref |
(Optional) Path to LD reference panel (PLINK prefix) |
Example population_config.txt:
popu cohort sample_size path ld_ref
EUR UKBB 400000 munged/EUR_UKBB.munged.txt.gz /data/ref/EUR
AFR AAGC 50000 munged/AFR_AAGC.munged.txt.gz /data/ref/AFR
EAS BBJ 180000 munged/EAS_BBJ.munged.txt.gz /data/ref/EAS
Info
This is the same TSV configuration format as the credtools munge output (sumstat_info_updated.txt). When an ld_ref column is present, the chunk command will also extract LD matrices automatically.
Warning
The configuration file is tab-delimited plain text, not JSON. If your file has a .json extension or uses JSON syntax, it will not be parsed correctly.
Command Reference¶
Arguments:
| Argument | Description |
|---|---|
INPUT_CONFIG |
Comma-separated munged file paths, or a population config file |
OUTPUT_DIR |
Output directory for chunked files |
Options:
| Option | Short | Description | Default |
|---|---|---|---|
--distance |
-d |
Distance threshold for independence (bp) | 500000 |
--pvalue |
-p |
P-value threshold for significance | 5e-8 |
--merge-overlapping |
-m |
Merge overlapping loci across ancestries | True |
--use-most-sig |
-u |
Use most significant SNP if no significant SNPs | True |
--min-variants |
-v |
Minimum variants per locus | 10 |
--custom-chunks |
-cc |
Custom chunk file with chr, start, end columns | None |
--ld-format |
-f |
LD computation format (plink/vcf) |
plink |
--keep-intermediate |
-k |
Keep intermediate files | False |
--threads |
-t |
Number of threads | 1 |
--log-file |
-l |
Log output to specified file | None |
Algorithm Details¶
Loci Identification Process¶
- Significance filtering: Identify SNPs below the p-value threshold
- Distance-based clustering: Group significant SNPs within the distance threshold
- Lead SNP selection: Choose the most significant SNP in each cluster
- Region definition: Create windows around lead SNPs (±distance/2)
- Overlap resolution: Merge overlapping regions across ancestries if requested
- Quality filtering: Remove loci with too few variants
Multi-Ancestry Handling¶
When processing multiple ancestries:
- Each ancestry is processed independently first
- Overlapping loci across ancestries are merged into unified regions
- The most significant lead SNP across all ancestries is selected for merged loci
- Ancestry information is preserved in the output (comma-separated labels)
Integration with Pipeline¶
graph LR
A[Raw GWAS files] -->|credtools munge| B[Munged files]
B -->|credtools chunk| C[Locus-specific files + LD matrices]
C -->|credtools finemap| D[Credible sets]
# Step 1: Munge summary statistics
credtools munge population_config.txt munged/
# Step 2: Identify loci, chunk data, and extract LD matrices
credtools chunk munged/sumstat_info_updated.txt chunks/
# Step 3: Run fine-mapping
credtools finemap chunks/loci_list.txt results/
Info
When your population config file includes an ld_ref column, credtools chunk automatically extracts LD matrices during the chunking step, combining the chunk and prepare stages into one command.
Output Files¶
The chunk command produces the following directory structure:
output_dir/
├── identified_loci.txt # Summary of all identified loci
├── loci_list.txt # Credtools-compatible loci list for fine-mapping
├── sumstat_info_updated.txt # Updated population config (when using config input)
├── chunks/
│ ├── chunk_info.txt # Metadata about all chunk files
│ ├── EUR_UKBB.chr1_1000_2000.sumstats.gz
│ ├── AFR_AAGC.chr1_1000_2000.sumstats.gz
│ └── ...
└── prepared/ # Only when ld_ref is provided
├── EUR_UKBB.chr1_1000_2000.ld.gz
└── ...
identified_loci.txt¶
| Column | Description |
|---|---|
chr |
Chromosome number |
start |
Locus start position (bp) |
end |
Locus end position (bp) |
lead_snp |
Lead SNP identifier (chr-bp-allele1-allele2) |
lead_bp |
Lead SNP base pair position |
lead_p |
Lead SNP p-value |
ancestry |
Ancestry labels (comma-separated if merged) |
n_variants |
Number of variants in the locus |
locus_id |
Unique locus identifier (chr{chr}_{start}_{end}) |
chunk_info.txt¶
| Column | Description |
|---|---|
locus_id |
Unique locus identifier |
ancestry |
Ancestry/cohort key |
chr |
Chromosome number |
start |
Locus start position (bp) |
end |
Locus end position (bp) |
n_variants |
Number of variants in this chunk |
sumstats_file |
Path to the chunked sumstats file |
loci_list.txt¶
| Column | Description |
|---|---|
locus_id |
Unique locus identifier |
chr |
Chromosome number |
start |
Locus start position (bp) |
end |
Locus end position (bp) |
popu |
Population/ancestry code |
cohort |
Cohort identifier |
sample_size |
Sample size |
prefix |
File prefix for credtools downstream tools |
Examples¶
Example 1: Conservative Loci Definition¶
# Use stricter thresholds for fewer, more significant loci
credtools chunk munged/ output/ \
--distance 1000000 \
--pvalue 1e-8 \
--min-variants 50
Example 2: Liberal Loci Definition¶
# Use relaxed thresholds for more comprehensive coverage
credtools chunk munged/ output/ \
--distance 250000 \
--pvalue 1e-5 \
--min-variants 5
Example 3: Custom Genomic Regions¶
# Use pre-defined genomic regions instead of automatic identification
credtools chunk munged/ output/ \
--custom-chunks custom_regions.txt
The custom chunk file should be tab-delimited with at minimum chr, start, and end columns:
Example 4: With Logging and Intermediate Files¶
# Keep intermediate files and write detailed log
credtools chunk population_config.txt output/ \
--keep-intermediate \
--log-file chunk_run.log
Parameter Guidelines¶
Distance Threshold (--distance)¶
- 500kb (default): Balanced approach, suitable for most analyses
- 250kb: More granular loci, better for dense association regions
- 1Mb: Conservative approach, reduces computational burden
- Consider LD patterns: Use longer distances in populations with extended LD
P-value Threshold (--pvalue)¶
- 5e-8 (default): Genome-wide significance threshold
- 1e-5: Suggestive significance, more inclusive
- 1e-8: Very stringent, fewer but highly significant loci
- Population-specific: Consider ancestry-specific significance levels
Minimum Variants (--min-variants)¶
- 10 (default): Ensures reasonable LD computation
- 5: More inclusive, useful for sparse regions
- 20–50: Conservative, better for high-quality fine-mapping
Troubleshooting¶
No loci identified
Lower the p-value threshold (--pvalue 1e-5) or enable --use-most-sig (enabled by default) to use the most significant SNP per chromosome even when nothing reaches genome-wide significance. Also verify that your munged files actually contain significant associations.
Too many small loci
Increase the distance threshold (--distance 1000000) or raise the minimum variant count (--min-variants 50). Small loci may not provide enough information for reliable fine-mapping.
Memory issues with large datasets
Use --threads to enable parallel processing and ensure sufficient RAM (typically 4–8 GB for genome-wide data). If memory is limited, consider processing ancestries individually before merging.
Inconsistent loci across ancestries
Ensure that all munged files use consistent chromosome and position encoding (same genome build). Enable --merge-overlapping (default) to unify overlapping loci from different ancestries into shared regions.
LD extraction fails
Check that the ld_ref column in your population config points to valid PLINK prefix paths (i.e., .bed, .bim, .fam files exist). Verify that the --ld-format matches your reference panel format (plink or vcf).
Custom chunks file not working
Ensure your custom chunk file is tab-delimited with at minimum chr, start, end columns. The chromosome column should use numeric values (1–22, not "chr1"). Check for trailing whitespace or encoding issues.
Best Practices¶
Recommendations
- Start with defaults — use default parameters initially, then optimize based on results
- Use population config files for multi-ancestry studies to preserve metadata and enable automatic LD extraction
- Review identified loci — check
identified_loci.txtagainst known associations for your trait - Balance precision vs. coverage — stricter thresholds give higher confidence but may miss signals
- Use
--log-filefor a detailed audit trail of all processing steps - Document parameters — keep track of thresholds used for reproducibility