LD Matrix Preparation¶
The credtools prepare
command extracts linkage disequilibrium (LD) matrices from genotype data and creates the final input files needed for fine-mapping analysis. This step bridges chunked summary statistics with reference genotype data.
Overview¶
Fine-mapping requires accurate LD information to model the correlation structure between variants. The preparation process:
- Extracts LD matrices from reference genotype panels for each locus
- Matches variants between summary statistics and genotype data
- Handles allele flipping and strand orientation issues
- Creates optimized files for fast fine-mapping computation
- Supports multiple formats including PLINK and VCF genotype files
- Enables parallel processing for computational efficiency
When to Use¶
Use credtools prepare
when you have:
- Chunked summary statistics from the previous step
- Reference genotype data (PLINK .bed/.bim/.fam or VCF files)
- Matched ancestry between summary statistics and reference panels
- Need to create final inputs for credtools fine-mapping
Basic Usage¶
Standard Workflow¶
# Create genotype configuration file
echo '{
"EUR": "/path/to/eur_reference",
"ASN": "/path/to/asn_reference",
"AFR": "/path/to/afr_reference"
}' > genotype_config.json
# Prepare LD matrices
credtools prepare chunked/chunk_info.txt genotype_config.json prepared/
Single Ancestry¶
# For single ancestry, still use JSON format
echo '{"EUR": "/path/to/reference"}' > genotype_config.json
credtools prepare chunk_info.txt genotype_config.json prepared/
Command Options¶
Option | Description | Default |
---|---|---|
--threads / -t |
Number of threads for parallel processing | 1 |
--ld-format / -f |
LD computation format (plink/vcf) | plink |
--keep-intermediate / -k |
Keep intermediate files | False |
Input Requirements¶
Chunk Info File¶
The chunk_info.txt
file from the previous chunking step contains:
- Locus coordinates and identifiers
- Paths to chunked summary statistics files
- Ancestry and sample information
Genotype Configuration¶
JSON file mapping ancestry codes to genotype file prefixes:
{
"EUR": "/data/reference/eur_1kg_phase3",
"ASN": "/data/reference/asn_1kg_phase3",
"AFR": "/data/reference/afr_1kg_phase3"
}
For PLINK format, provide the prefix (without .bed/.bim/.fam extensions). For VCF format, provide the full path to the VCF file.
Reference Panel Requirements¶
PLINK format (.bed/.bim/.fam): - Binary genotype files with complete trio - BIM file with chromosome, SNP ID, genetic distance, position, alleles - FAM file with sample information
VCF format (planned support): - Compressed VCF files (.vcf.gz) with tabix index - Proper chromosome and position formatting - Consistent allele encoding
Algorithm Details¶
Processing Pipeline¶
- Variant extraction: Extract variants within each locus region from genotype data
- LD computation: Calculate correlation matrix using PLINK or custom methods
- Data intersection: Match variants between summary statistics and LD data
- Allele alignment: Handle flipped alleles and strand orientation
- Quality control: Filter variants and validate LD matrix properties
- File generation: Create compressed output files for fine-mapping
Allele Handling¶
The preparation step carefully handles: - Strand flipping: Automatic detection and correction of strand issues - Allele ordering: Consistent alphabetical ordering of allele pairs - Reference matching: Alignment between summary statistics and reference panel alleles
Expected Output¶
File Structure¶
For each locus and ancestry combination:
prepared/
├── EUR.chr1_12345_67890.sumstats.gz # Intersected summary statistics
├── EUR.chr1_12345_67890.ld.npz # Compressed LD matrix
├── EUR.chr1_12345_67890.ldmap.gz # LD variant mapping
├── ASN.chr1_12345_67890.sumstats.gz
├── ASN.chr1_12345_67890.ld.npz
├── ASN.chr1_12345_67890.ldmap.gz
├── prepared_files.txt # Summary of all prepared files
└── final_loci_list.txt # Updated loci list for fine-mapping
File Formats¶
Sumstats files: Tab-separated, gzipped summary statistics with only variants present in LD matrix.
LD matrix files: NumPy compressed arrays (.npz) containing correlation matrices optimized for memory and speed.
LD map files: Tab-separated mapping files linking matrix positions to genomic coordinates and allele information.
Final loci list: Updated credtools-compatible format ready for fine-mapping.
Examples¶
Example 1: Standard Multi-Ancestry Preparation¶
# Set up genotype configuration
echo '{
"EUR": "/reference/1000G_phase3/EUR",
"ASN": "/reference/1000G_phase3/ASN",
"AFR": "/reference/1000G_phase3/AFR"
}' > genotypes.json
# Prepare with parallel processing
credtools prepare chunk_info.txt genotypes.json prepared/ --threads 4
Example 2: Single Large Reference Panel¶
# Use same reference for multiple ancestries
echo '{
"EUR": "/reference/1000G_phase3/ALL",
"ASN": "/reference/1000G_phase3/ALL",
"AFR": "/reference/1000G_phase3/ALL"
}' > genotypes.json
credtools prepare chunk_info.txt genotypes.json prepared/
Example 3: High-Performance Setup¶
# Maximize parallel processing and keep intermediate files for debugging
credtools prepare chunk_info.txt genotypes.json prepared/ \
--threads 8 \
--keep-intermediate
Genotype Data Setup¶
Using 1000 Genomes Project Data¶
Download and prepare 1000 Genomes reference panels:
# Download 1000G Phase 3 data
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz
tar -xzf 1000GP_Phase3.tgz
# Convert to PLINK format (example for chromosome 1)
plink --vcf 1000GP_Phase3_chr1.vcf.gz --make-bed --out 1000G_chr1
# Combine chromosomes and filter by ancestry
plink --bfile 1000G_merged --keep EUR_samples.txt --make-bed --out EUR_reference
Population-Specific Panels¶
For best results, use ancestry-matched reference panels:
{
"EUR": "/reference/UKBB_EUR_50k",
"ASN": "/reference/BBJ_ASN_10k",
"AFR": "/reference/H3Africa_AFR_5k"
}
Performance Optimization¶
Parallel Processing¶
The --threads
option parallelizes across ancestries:
# Optimal thread count: number of ancestries or CPU cores, whichever is smaller
credtools prepare chunk_info.txt genotypes.json prepared/ --threads 6
Memory Management¶
For large datasets: - Use more threads to parallelize memory usage across cores - Ensure sufficient RAM (recommended: 4-8GB per thread) - Consider processing subsets of loci if memory-constrained
Storage Considerations¶
Prepared files are optimized for size: - LD matrices use float16 precision (sufficient for fine-mapping) - Files are compressed for minimal storage footprint - Intermediate files can be removed (default behavior)
Integration with Workflow¶
Prepared files feed directly into fine-mapping:
# 1. Munge summary statistics
credtools munge ancestry_files.json munged/
# 2. Identify loci and chunk data
credtools chunk munged/ chunked/
# 3. Prepare LD matrices
credtools prepare chunked/chunk_info.txt genotype_config.json prepared/
# 4. Run fine-mapping (uses final_loci_list.txt)
credtools finemap prepared/final_loci_list.txt results/
# 5. Results are saved in results/ directory
Quality Control¶
Checking Preparation Success¶
# Count successful preparations
grep -c "created" prepared/prepared_files.txt
# Check for failures
grep "failed\|error" prepared/prepared_files.txt
# Verify file completeness
ls prepared/*.ld.npz | wc -l
Variant Intersection Stats¶
Review how many variants were successfully matched:
# Check intersection efficiency per locus
awk '{print $1, $7}' prepared/prepared_files.txt | head -10
Troubleshooting¶
Common Issues¶
No variants found: Check that reference genotype files cover the genomic regions of interest and use the correct chromosome encoding (1-22 vs chr1-chr22).
Allele mismatches: Verify that summary statistics and reference panels use consistent allele encoding (A/T/G/C vs A/T/C/G vs numeric).
PLINK errors: Ensure PLINK is installed and accessible in PATH. Check that genotype files are not corrupted.
Memory errors: Reduce the number of threads or process loci in smaller batches.
File permission errors: Verify read access to genotype files and write access to output directory.
Performance Issues¶
Slow processing: Increase --threads
up to the number of available CPU cores.
Disk space: Monitor storage usage, especially with --keep-intermediate
. Clean up failed runs.
Network storage: If using network-mounted genotype files, consider copying to local storage first.
Advanced Usage¶
Custom LD Computation¶
For specialized reference panels or non-standard formats:
# Use VCF format (when supported)
credtools prepare chunk_info.txt genotypes.json prepared/ --ld-format vcf
Debugging Failed Loci¶
# Keep intermediate files for troubleshooting
credtools prepare chunk_info.txt genotypes.json prepared/ \
--keep-intermediate
# Check PLINK log files
ls prepared/*_temp.log
Subset Processing¶
To process only specific loci:
# Filter chunk_info.txt to specific loci
grep "chr1_" chunk_info.txt > chr1_chunks.txt
credtools prepare chr1_chunks.txt genotypes.json prepared/
Best Practices¶
- Use ancestry-matched references: Match reference panels to summary statistics ancestry
- Verify coordinate systems: Ensure consistent genome builds (GRCh37/hg19 vs GRCh38/hg38)
- Monitor resource usage: Balance thread count with available memory
- Test with small datasets: Validate setup with a few loci before processing all data
- Backup genotype files: Keep copies of processed reference panels
- Document configurations: Save genotype configurations for reproducibility
- Quality control: Always review preparation summary before fine-mapping
Tips for Success¶
- Start simple: Begin with single ancestry and small number of loci
- Check early: Verify first few loci process correctly before running all
- Plan storage: Ensure adequate disk space for output files
- Use consistent naming: Keep ancestry codes consistent across all steps
- Monitor progress: Watch console output for processing status and errors