Loci Identification and Chunking¶
The credtools chunk
command identifies independent genetic loci from munged summary statistics and splits the data into locus-specific files ready for fine-mapping. This step is crucial for defining the genomic regions that will be analyzed independently.
Overview¶
After munging your summary statistics, you need to identify independent genetic signals and create focused datasets for fine-mapping. The chunking process:
- Identifies independent loci based on distance and significance thresholds
- Merges overlapping regions across different ancestries when appropriate
- Creates locus-specific files containing only variants within each region
- Generates credtools-compatible loci lists for downstream analysis
- Handles multi-ancestry data with consistent loci across populations
When to Use¶
Use credtools chunk
when you have:
- Munged summary statistics ready for fine-mapping
- Genome-wide data that needs to be split into independent regions
- Multi-ancestry studies requiring consistent loci definitions
- Large datasets that benefit from parallel processing of smaller regions
Basic Usage¶
Single Ancestry¶
Multiple Ancestries¶
Using File Configuration¶
Create a JSON file mapping ancestries to munged files:
{
"EUR": "munged/EUR.munged.txt.gz",
"ASN": "munged/ASN.munged.txt.gz",
"AFR": "munged/AFR.munged.txt.gz"
}
Then run:
Command Options¶
Option | Description | Default |
---|---|---|
--distance / -d |
Distance threshold for independence (bp) | 500000 |
--pvalue / -p |
P-value threshold for significance | 5e-8 |
--merge-overlapping / -m |
Merge overlapping loci across ancestries | True |
--use-most-sig / -u |
Use most significant SNP if no significant SNPs | True |
--min-variants / -v |
Minimum variants per locus | 10 |
--threads / -t |
Number of threads | 1 |
Algorithm Details¶
Loci Identification Process¶
- Significance filtering: Identify SNPs below the p-value threshold
- Distance-based clustering: Group significant SNPs within the distance threshold
- Lead SNP selection: Choose the most significant SNP in each cluster
- Region definition: Create windows around lead SNPs (±distance/2)
- Overlap resolution: Merge overlapping regions across ancestries if requested
- Quality filtering: Remove loci with too few variants
Multi-Ancestry Handling¶
When processing multiple ancestries:
- Each ancestry is processed independently first
- Overlapping loci across ancestries can be merged into unified regions
- The most significant lead SNP across all ancestries is selected for merged loci
- Ancestry information is preserved in the output
Expected Output¶
The chunking process creates several important files:
Main Output Files¶
identified_loci.txt
- Summary of all identified loci with coordinates and lead SNPschunks/
- Directory containing locus-specific summary statistics fileschunk_info.txt
- Metadata about all generated chunk filesloci_list.txt
- Credtools-compatible loci list for fine-mapping
Chunk File Structure¶
Each locus generates ancestry-specific files:
chunks/
├── EUR.chr1_12345_67890.sumstats.gz
├── ASN.chr1_12345_67890.sumstats.gz
├── AFR.chr1_12345_67890.sumstats.gz
└── ...
Loci List Format¶
The loci_list.txt
file contains:
Column | Description |
---|---|
locus_id | Unique identifier (chr_start_end) |
chr | Chromosome number |
start | Start position (bp) |
end | End position (bp) |
popu | Population/ancestry code |
cohort | Cohort identifier |
sample_size | Sample size (placeholder) |
prefix | File prefix for credtools |
Examples¶
Example 1: Conservative Loci Definition¶
# Use stricter thresholds for fewer, more significant loci
credtools chunk munged_files.json output/ \
--distance 1000000 \
--pvalue 1e-8 \
--min-variants 50
Example 2: Liberal Loci Definition¶
# Use relaxed thresholds for more comprehensive coverage
credtools chunk munged_files.json output/ \
--distance 250000 \
--pvalue 1e-5 \
--min-variants 5
Example 3: Population-Specific Analysis¶
# Don't merge overlapping loci across ancestries
credtools chunk ancestry_files.json output/ \
--merge-overlapping false
Example 4: Parallel Processing¶
# Use multiple threads for faster processing
credtools chunk large_dataset.json output/ \
--threads 8
Parameter Guidelines¶
Distance Threshold (--distance
)¶
- 500kb (default): Balanced approach, suitable for most analyses
- 250kb: More granular loci, better for dense association regions
- 1Mb: Conservative approach, reduces computational burden
- Consider LD patterns: Longer distance in populations with extended LD
P-value Threshold (--pvalue
)¶
- 5e-8 (default): Genome-wide significance threshold
- 1e-5: Suggestive significance, more inclusive
- 1e-8: Very stringent, fewer but highly significant loci
- Population-specific: Consider ancestry-specific significance levels
Minimum Variants (--min-variants
)¶
- 10 (default): Ensures reasonable LD computation
- 5: More inclusive, useful for sparse regions
- 20-50: Conservative, better for high-quality fine-mapping
Integration with Workflow¶
Chunked data feeds directly into the preparation step:
# 1. Munge summary statistics
credtools munge ancestry_files.json munged/
# 2. Identify loci and chunk data
credtools chunk munged/ chunked/
# 3. Prepare LD matrices (uses chunk_info.txt)
credtools prepare chunked/chunk_info.txt genotype_config.json prepared/
# 4. Run fine-mapping (uses final_loci_list.txt)
credtools finemap prepared/final_loci_list.txt results/
Quality Control¶
Reviewing Loci¶
After chunking, examine the results:
# Check number of loci identified
wc -l chunked/identified_loci.txt
# Review loci summary
head -20 chunked/identified_loci.txt
# Check chunk file counts
ls chunked/chunks/ | wc -l
Ancestry Coverage¶
Verify consistent coverage across ancestries:
Troubleshooting¶
Common Issues¶
No loci identified: Lower the p-value threshold or check that munged files contain significant associations.
Too many small loci: Increase the distance threshold or minimum variants requirement.
Memory issues with large datasets: Use parallel processing with --threads
and ensure sufficient RAM.
Inconsistent loci across ancestries: Check that munged files use consistent chromosome and position formats.
Performance Optimization¶
Large datasets: Use more threads (--threads 4-8
) for faster processing.
Memory constraints: Process ancestries separately by providing single files rather than multiple files.
Storage space: Consider the trade-off between number of loci and storage requirements for chunk files.
Advanced Usage¶
Custom Significance Levels¶
For population-specific analysis, you might use different p-value thresholds:
# European ancestry (well-powered)
credtools chunk EUR.munged.txt.gz eur_chunks/ --pvalue 5e-8
# Smaller ancestries (more lenient)
credtools chunk ASN.munged.txt.gz asn_chunks/ --pvalue 1e-6
Region-Specific Analysis¶
To focus on specific genomic regions:
# Pre-filter summary statistics to chromosome 1
zcat EUR.munged.txt.gz | awk '$1==1' | gzip > EUR_chr1.munged.txt.gz
credtools chunk EUR_chr1.munged.txt.gz chr1_chunks/
Tips for Success¶
- Start with defaults: Use default parameters initially, then optimize based on results
- Consider your goals: More loci = more comprehensive but computationally intensive
- Check overlap: Review identified loci for known associations in your trait
- Balance precision vs coverage: Stricter thresholds give higher confidence but may miss signals
- Document parameters: Keep track of thresholds used for reproducibility
- Validate with literature: Compare identified loci with known associations for your trait