Preprocessing API¶
These functions power credtools munge and credtools chunk.
Munging¶
Munge summary statistics using smunger integration.
This module provides functionality to reformat and standardize GWAS summary statistics from various formats into a consistent format suitable for fine-mapping.
create_munge_config(sample_files, output_config, interactive=True)
¶
Create configuration file for munging by examining sample files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_files
|
Dict[str, str]
|
Dictionary mapping identifiers to sample file paths. |
required |
output_config
|
str
|
Output path for the configuration file. |
required |
interactive
|
bool
|
Whether to use interactive mode for column mapping, by default True. |
True
|
Notes
This function helps users create configuration files by examining the headers of input files and providing suggested column mappings.
Source code in credtools/preprocessing/munge.py
munge_sumstats(input_files, output_dir, config_file=None, force_overwrite=False, **kwargs)
¶
Munge summary statistics files using smunger integration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_files
|
Union[str, List[str], Dict[str, str]]
|
Input summary statistics file(s). Can be: - Single file path (str) - List of file paths (List[str]) - Dictionary mapping ancestry/cohort names to file paths (Dict[str, str]) |
required |
output_dir
|
str
|
Output directory for munged files. |
required |
config_file
|
Optional[str]
|
Path to configuration file specifying column mappings, by default None. |
None
|
force_overwrite
|
bool
|
Whether to overwrite existing output files, by default False. |
False
|
**kwargs
|
Additional arguments passed to smunger functions. |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, str]
|
Dictionary mapping input identifiers to output file paths. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If smunger package is not available. |
FileNotFoundError
|
If input files do not exist. |
ValueError
|
If input format is invalid. |
Examples:
>>> # Single file
>>> result = munge_sumstats("gwas_eur.txt", "output/")
>>>
>>> # Multiple files with ancestry labels
>>> files = {"EUR": "gwas_eur.txt", "ASN": "gwas_asn.txt"}
>>> result = munge_sumstats(files, "output/")
Source code in credtools/preprocessing/munge.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
validate_munged_files(munged_files, required_columns=None)
¶
Validate munged summary statistics files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
munged_files
|
Dict[str, str]
|
Dictionary mapping identifiers to munged file paths. |
required |
required_columns
|
Optional[List[str]]
|
List of required column names to check, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Dict]
|
Dictionary with validation results for each file. |
Source code in credtools/preprocessing/munge.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | |
Chunking¶
Chunk whole genome summary statistics into independent loci.
This module provides functionality to identify independent lead SNPs and create regional chunks suitable for fine-mapping analysis across multiple ancestries.
chunk_sumstats(loci_df, sumstats_files, output_dir, threads=1, compress=True)
¶
Chunk summary statistics files into loci-specific files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loci_df
|
DataFrame
|
DataFrame with loci coordinates from identify_independent_loci. |
required |
sumstats_files
|
Dict[str, str]
|
Dictionary mapping ancestry names to sumstats file paths. |
required |
output_dir
|
str
|
Output directory for chunked files. |
required |
threads
|
int
|
Number of threads for parallel processing, by default 1. |
1
|
compress
|
bool
|
Whether to compress output files, by default True. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with information about generated files. |
Examples:
>>> loci_df = identify_independent_loci(files, "output/")
>>> file_info = chunk_sumstats(loci_df, files, "output/chunks/")
Source code in credtools/preprocessing/chunk.py
245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 | |
create_loci_list_for_credtools(chunk_info_df, ld_info_df=None, output_file='loci_list.txt')
¶
Create loci list file compatible with credtools format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_info_df
|
DataFrame
|
DataFrame from chunk_sumstats with file information. |
required |
ld_info_df
|
Optional[DataFrame]
|
DataFrame with LD file information, by default None. |
None
|
output_file
|
str
|
Output file path, by default "loci_list.txt". |
'loci_list.txt'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame in credtools loci list format. |
Source code in credtools/preprocessing/chunk.py
identify_independent_loci(sumstats_files, output_dir, distance_threshold=500000, pvalue_threshold=5e-08, merge_overlapping=True, use_most_sig_if_no_sig=True, min_variants_per_locus=10, **kwargs)
¶
Identify independent loci across multiple ancestries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sumstats_files
|
Union[Dict[str, str], str]
|
Dictionary mapping ancestry/cohort names to munged sumstats files, or single file path. |
required |
output_dir
|
str
|
Output directory for results. |
required |
distance_threshold
|
int
|
Distance threshold in base pairs for independence, by default 500000. |
500000
|
pvalue_threshold
|
float
|
P-value threshold for significance, by default 5e-8. |
5e-08
|
merge_overlapping
|
bool
|
Whether to merge overlapping loci across ancestries, by default True. |
True
|
use_most_sig_if_no_sig
|
bool
|
Whether to use most significant SNP if no significant SNPs found, by default True. |
True
|
min_variants_per_locus
|
int
|
Minimum number of variants required per locus, by default 10. |
10
|
**kwargs
|
Additional parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with identified loci coordinates and lead SNPs. |
Examples:
>>> files = {"EUR": "eur.munged.txt.gz", "ASN": "asn.munged.txt.gz"}
>>> loci_df = identify_independent_loci(files, "output/")
Source code in credtools/preprocessing/chunk.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
Prepare Helpers¶
Prepare LD matrices and final fine-mapping inputs.
This module provides functionality to extract LD matrices from genotype data and create final input files compatible with credtools fine-mapping pipeline.
prepare_finemap_inputs(chunk_info_df, genotype_files, output_dir, threads=1, ld_format='plink', keep_intermediate=False, **kwargs)
¶
Prepare final fine-mapping input files from chunked sumstats and genotype data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_info_df
|
DataFrame
|
DataFrame from chunk_sumstats with chunked file information. |
required |
genotype_files
|
Dict[str, str]
|
Dictionary mapping ancestry names to genotype file prefixes. Supports PLINK format (.bed/.bim/.fam) and VCF format. |
required |
output_dir
|
str
|
Output directory for prepared files. |
required |
threads
|
int
|
Number of threads for parallel processing, by default 1. |
1
|
ld_format
|
str
|
Format for LD computation ("plink", "vcf"), by default "plink". |
'plink'
|
keep_intermediate
|
bool
|
Whether to keep intermediate files, by default False. |
False
|
**kwargs
|
Additional parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with information about prepared files. |
Examples:
>>> genotype_files = {"EUR": "eur_genotypes", "ASN": "asn_genotypes"}
>>> prepared_df = prepare_finemap_inputs(chunk_info_df, genotype_files, "output/")
Source code in credtools/preprocessing/prepare.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | |
Low-Level Munging Helpers¶
Core munging functions for GWAS summary statistics.
Adapted from smunger (https://github.com/Jianhua-Wang/smunger) Original author: Jianhua Wang License: MIT
This module provides the main data cleaning and standardization functions for processing GWAS summary statistics.
make_SNPID_unique(df, remove_duplicates=True, col_chr=ColName.CHR, col_bp=ColName.BP, col_ea=ColName.EA, col_nea=ColName.NEA, col_p=ColName.P)
¶
Generate unique SNP identifiers and optionally remove duplicates.
Adapted from smunger.make_SNPID_unique() function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. |
required |
remove_duplicates
|
bool
|
Whether to remove duplicated SNPs, keeping the one with smallest p-value. |
True
|
col_chr
|
str
|
Column name for chromosome. |
CHR
|
col_bp
|
str
|
Column name for base pair position. |
BP
|
col_ea
|
str
|
Column name for effect allele. |
EA
|
col_nea
|
str
|
Column name for non-effect allele. |
NEA
|
col_p
|
str
|
Column name for p-value. |
P
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with unique SNPID column. |
Source code in credtools/preprocessing/munging/core.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | |
munge(df)
¶
Clean and standardize GWAS summary statistics.
Adapted from smunger.munge() function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with GWAS summary statistics. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Cleaned and standardized DataFrame. |
Notes
This function performs comprehensive data cleaning including: 1. Removing columns with all NA values 2. Cleaning and validating core columns (CHR, BP, alleles) 3. Creating unique SNP identifiers 4. Processing p-values, effect sizes, and other statistics 5. Sorting by chromosome and position
Source code in credtools/preprocessing/munging/core.py
transform_allele(series)
¶
Transform allele values to standard format.
Source code in credtools/preprocessing/munging/core.py
transform_chr(series)
¶
Transform chromosome values to standard format.
Source code in credtools/preprocessing/munging/core.py
Header detection and mapping utilities for GWAS summary statistics.
Adapted from smunger (https://github.com/Jianhua-Wang/smunger) Original author: Jianhua Wang License: MIT
This module provides functionality to detect, map, and standardize column headers from various GWAS summary statistics file formats.
apply_header_mapping(df, mapping)
¶
Apply header mapping to DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. |
required |
mapping
|
Dict[str, str]
|
Mapping from current column names to new names. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with renamed columns. |
Source code in credtools/preprocessing/munging/headers.py
create_config_template(headers, interactive=False)
¶
Create configuration template for column mapping.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
headers
|
List[str]
|
List of column headers. |
required |
interactive
|
bool
|
Whether to use interactive mode for mapping. |
False
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Configuration dictionary with column mappings. |
Source code in credtools/preprocessing/munging/headers.py
inspect_headers(file_path, sep=None, nrows=5)
¶
Inspect file headers and return column names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the input file. |
required |
sep
|
str
|
Column separator. If None, will try to auto-detect. |
None
|
nrows
|
int
|
Number of rows to read for inspection. |
5
|
Returns:
| Type | Description |
|---|---|
List[str]
|
List of column headers. |
Source code in credtools/preprocessing/munging/headers.py
map_headers_automatic(headers)
¶
Automatically map headers to standard column names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
headers
|
List[str]
|
List of column headers from input file. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, str]
|
Mapping from original headers to standard column names. |
Source code in credtools/preprocessing/munging/headers.py
suggest_missing_mappings(headers, mapped_headers)
¶
Suggest mappings for unmapped headers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
headers
|
List[str]
|
Original headers. |
required |
mapped_headers
|
Dict[str, str]
|
Already mapped headers. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, str]
|
Suggestions for unmapped headers. |
Source code in credtools/preprocessing/munging/headers.py
validate_required_columns(df, required=None)
¶
Validate that required columns are present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to validate. |
required |
required
|
List[str]
|
List of required column names. Uses mandatory columns if None. |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if all required columns are present. |
Source code in credtools/preprocessing/munging/headers.py
Validation functions for GWAS summary statistics columns.
Adapted from smunger (https://github.com/Jianhua-Wang/smunger) Original author: Jianhua Wang License: MIT
This module provides validation and cleaning functions for individual columns in GWAS summary statistics data.
check_mandatory_cols(df)
¶
Check if DataFrame contains all mandatory columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to validate. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any mandatory columns are missing. |
Source code in credtools/preprocessing/munging/validation.py
validate_allele_consistency(df)
¶
Validate that alleles are consistent and biallelic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with EA and NEA columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with consistent alleles. |
Source code in credtools/preprocessing/munging/validation.py
validate_and_clean_column(df, col_name, col_type, min_val=None, max_val=None, allow_na=True, exclude_min=False, exclude_max=False, transform_func=None)
¶
Validate and clean a single column in the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame. |
required |
col_name
|
str
|
Name of column to validate. |
required |
col_type
|
type
|
Target data type for the column. |
required |
min_val
|
float
|
Minimum allowed value. |
None
|
max_val
|
float
|
Maximum allowed value. |
None
|
allow_na
|
bool
|
Whether NA values are allowed. |
True
|
exclude_min
|
bool
|
Whether to exclude the minimum value itself. |
False
|
exclude_max
|
bool
|
Whether to exclude the maximum value itself. |
False
|
transform_func
|
callable
|
Function to transform values before validation. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with cleaned column. |
Source code in credtools/preprocessing/munging/validation.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | |
validate_frequency_consistency(df)
¶
Validate frequency column consistency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with frequency columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with consistent frequency data. |
Source code in credtools/preprocessing/munging/validation.py
validate_pvalue_consistency(df)
¶
Validate p-value consistency with other statistics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with p-value and other statistical columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with consistent p-values. |
Source code in credtools/preprocessing/munging/validation.py
validate_statistical_consistency(df)
¶
Validate statistical consistency (e.g., beta and SE relationship).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with statistical columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with statistically consistent data. |