Core Objects¶
Use these objects when you want to load data yourself and call CREDTOOLS from Python.
Locus Objects¶
Class for the input data of the fine-mapping analysis.
Locus(popu, cohort, sample_size, sumstats, locus_start, locus_end, ld=None, if_intersect=False)
¶
Locus class to represent a genomic locus with associated summary statistics and linkage disequilibrium (LD) matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
popu
|
str
|
Population code. e.g. "EUR". Choose from ["AFR", "AMR", "EAS", "EUR", "SAS"]. |
required |
cohort
|
str
|
Cohort name. |
required |
sample_size
|
int
|
Sample size. |
required |
sumstats
|
DataFrame
|
Summary statistics DataFrame. |
required |
ld
|
Optional[LDMatrix]
|
LD matrix, by default None. |
None
|
if_intersect
|
bool
|
Whether to intersect the LD matrix and summary statistics file, by default False. |
False
|
Attributes:
| Name | Type | Description |
|---|---|---|
original_sumstats |
DataFrame
|
The original summary statistics file. |
sumstats |
DataFrame
|
The processed summary statistics file. |
ld |
LDMatrix
|
The LD matrix object. |
chrom |
int
|
Chromosome. |
start |
int
|
Start position of the locus. |
end |
int
|
End position of the locus. |
n_snps |
int
|
Number of SNPs in the locus. |
prefix |
str
|
The prefix combining population and cohort. |
locus_id |
str
|
Unique identifier for the locus. |
is_matched |
bool
|
Whether the LD matrix and summary statistics file are matched. |
lambda_s |
Optional[float]
|
The estimated lambda_s parameter from estimate_s_rss function, None if not calculated. |
Notes
If no LD matrix is provided, only ABF method can be used for fine-mapping.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
popu
|
str
|
Population code. e.g. "EUR". Choose from ["AFR", "AMR", "EAS", "EUR", "SAS"]. |
required |
cohort
|
str
|
Cohort name. |
required |
sample_size
|
int
|
Sample size. |
required |
sumstats
|
DataFrame
|
Summary statistics DataFrame. |
required |
locus_start
|
int
|
Fixed start position for the locus. |
required |
locus_end
|
int
|
Fixed end position for the locus. |
required |
ld
|
Optional[LDMatrix]
|
LD matrix, by default None. |
None
|
if_intersect
|
bool
|
Whether to intersect the LD matrix and summary statistics file, by default False. |
False
|
Warnings
If no LD matrix is provided, a warning is logged that only ABF method can be used.
Source code in credtools/locus.py
chrom
property
¶
Get the chromosome.
cohort
property
¶
Get the cohort name.
end
property
¶
Get the end position.
is_matched
property
¶
Check if the LD matrix and sumstats file are matched.
locus_id
property
¶
Get the locus ID.
n_snps
property
¶
Get the number of SNPs.
original_sumstats
property
¶
Get the original sumstats file.
popu
property
¶
Get the population code.
prefix
property
¶
Get the prefix of the locus.
sample_size
property
¶
Get the sample size.
start
property
¶
Get the start position.
__repr__()
¶
Return a string representation of the Locus object.
Returns:
| Type | Description |
|---|---|
str
|
String representation of the Locus object. |
Source code in credtools/locus.py
copy()
¶
Copy the Locus object.
Returns:
| Type | Description |
|---|---|
Locus
|
A copy of the Locus object. |
Source code in credtools/locus.py
LocusSet(loci)
¶
LocusSet class to represent a set of genomic loci.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loci
|
List[Locus]
|
List of Locus objects. |
required |
Attributes:
| Name | Type | Description |
|---|---|---|
loci |
List[Locus]
|
List of Locus objects. |
n_loci |
int
|
Number of loci. |
chrom |
int
|
Chromosome number. |
start |
int
|
Start position of the locus. |
end |
int
|
End position of the locus. |
locus_id |
str
|
Unique identifier for the locus. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the chromosomes of the loci are not the same. |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loci
|
List[Locus]
|
List of Locus objects. |
required |
Source code in credtools/locus.py
chrom
property
¶
Get the chromosome.
Returns:
| Type | Description |
|---|---|
int
|
Chromosome number. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the chromosomes of the loci are not the same. |
end
property
¶
Get the end position.
locus_id
property
¶
Get the locus ID.
n_loci
property
¶
Get the number of loci.
start
property
¶
Get the start position.
__repr__()
¶
Return a string representation of the LocusSet object.
Returns:
| Type | Description |
|---|---|
str
|
String representation of the LocusSet object. |
Source code in credtools/locus.py
check_loci_info(loci_info)
¶
Check and validate loci information DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
loci_info
|
DataFrame
|
DataFrame containing loci information. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Validated and type-corrected loci_info DataFrame. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing, data types are incorrect, or locus_id/boundary consistency checks fail. |
Notes
This function performs the following checks: 1. Ensures all required columns are present 2. Validates and converts data types 3. Checks that loci with same locus_id have same chr, start, end 4. Validates chromosome, start, and end values
Source code in credtools/locus.py
420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 | |
intersect_sumstat_ld(locus)
¶
Intersect the Variant IDs in the LD matrix and the sumstats file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
locus
|
Locus
|
Locus object containing LD matrix and summary statistics. |
required |
Returns:
| Type | Description |
|---|---|
Locus
|
Locus object containing the intersected LD matrix and sumstats file. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If LD matrix not found or no common Variant IDs found between the LD matrix and the sumstats file. |
Warnings
If only a few common Variant IDs are found (≤ 10), a warning is logged.
Notes
This function performs the following operations:
- Checks if LD matrix and summary statistics are already matched
- Finds common SNP IDs between LD matrix and summary statistics
- Subsets both datasets to common variants
- Reorders data to maintain consistency
- Returns a new Locus object with intersected data
Source code in credtools/locus.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 | |
load_locus(prefix, popu, cohort, sample_size, locus_start, locus_end, if_intersect=False, calculate_lambda_s=False, **kwargs)
¶
Load the input data of the fine-mapping analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prefix
|
str
|
Prefix of the input files. |
required |
popu
|
str
|
Population of the input data. |
required |
cohort
|
str
|
Cohort of the input data. |
required |
sample_size
|
int
|
Sample size of the input data. |
required |
locus_start
|
int
|
Fixed start position for the locus. |
required |
locus_end
|
int
|
Fixed end position for the locus. |
required |
if_intersect
|
bool
|
Whether to intersect the input data with the LD matrix, by default False. |
False
|
calculate_lambda_s
|
bool
|
Whether to calculate lambda_s parameter using estimate_s_rss function, by default False. |
False
|
**kwargs
|
Any
|
Additional keyword arguments passed to loading functions. |
{}
|
Returns:
| Type | Description |
|---|---|
Locus
|
Locus object containing the input data. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the required input files are not found. |
Notes
The function looks for files with the following patterns:
- Summary statistics: {prefix}.sumstat or {prefix}.sumstats.gz
- LD matrix: {prefix}.ld or {prefix}.ld.npz
- LD map: {prefix}.ldmap or {prefix}.ldmap.gz
All files are required for proper functioning.
Examples:
>>> locus = load_locus('EUR_study1', 'EUR', 'study1', 50000)
>>> print(f"Loaded locus with {locus.n_snps} SNPs")
Loaded locus with 10000 SNPs
Source code in credtools/locus.py
512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 | |
load_locus_set(locus_info, if_intersect=False, calculate_lambda_s=False, **kwargs)
¶
Load the input data of the fine-mapping analysis for multiple loci.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
locus_info
|
DataFrame
|
DataFrame containing the locus information with required columns: ['prefix', 'popu', 'cohort', 'sample_size', 'chr', 'start', 'end', 'locus_id']. |
required |
if_intersect
|
bool
|
Whether to intersect the input data with the LD matrix, by default False. |
False
|
calculate_lambda_s
|
bool
|
Whether to calculate lambda_s parameter using estimate_s_rss function, by default False. |
False
|
**kwargs
|
Any
|
Additional keyword arguments passed to load_locus function. |
{}
|
Returns:
| Type | Description |
|---|---|
LocusSet
|
LocusSet object containing the input data. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If required columns are missing or if the combination of popu and cohort is not unique. |
Notes
The locus_info DataFrame must contain the following columns:
- prefix: File prefix for each locus
- popu: Population code
- cohort: Cohort name
- sample_size: Sample size for the cohort
- chr: Chromosome number
- start: Start position of the locus
- end: End position of the locus
- locus_id: Locus identifier
All rows must have the same chr, start, end, locus_id values (representing the same locus).
Examples:
>>> locus_info = pd.DataFrame({
... 'prefix': ['EUR_study1', 'ASN_study2'],
... 'popu': ['EUR', 'ASN'],
... 'cohort': ['study1', 'study2'],
... 'sample_size': [50000, 30000]
... })
>>> locus_set = load_locus_set(locus_info)
>>> print(f"Loaded {locus_set.n_loci} loci")
Loaded 2 loci
Source code in credtools/locus.py
624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 | |
LD Matrices¶
Functions for reading and converting lower triangle matrices.
LDMatrix(map_df, r)
¶
Class to store the LD matrix and the corresponding Variant IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
map_df
|
DataFrame
|
DataFrame containing the Variant IDs. |
required |
r
|
ndarray
|
LD matrix. |
required |
Attributes:
| Name | Type | Description |
|---|---|---|
map |
DataFrame
|
DataFrame containing the Variant IDs. |
r |
ndarray
|
LD matrix. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of rows in the map file does not match the number of rows in the LD matrix. |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
map_df
|
DataFrame
|
DataFrame containing the Variant IDs. |
required |
r
|
ndarray
|
LD matrix. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of rows in the map file does not match the number of rows in the LD matrix. |
Source code in credtools/ldmatrix.py
__check_length()
¶
Check if the number of rows in the map file matches the number of rows in the LD matrix.
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of rows in the map file does not match the number of rows in the LD matrix. |
Source code in credtools/ldmatrix.py
__repr__()
¶
Return a string representation of the LDMatrix object.
Returns:
| Type | Description |
|---|---|
str
|
String representation showing the shapes of map and r. |
load_ld(ld_path, map_path, delimiter='\t', if_sort_alleles=True)
¶
Read LD matrices and Variant IDs from files. Pair each matrix with its corresponding Variant IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ld_path
|
str
|
Path to the input text file containing the lower triangle matrix or .npz file. |
required |
map_path
|
str
|
Path to the input text file containing the Variant IDs. |
required |
delimiter
|
str
|
Delimiter used in the input file, by default "\t". |
'\t'
|
if_sort_alleles
|
bool
|
Sort alleles in the LD map in alphabetical order and change the sign of the LD matrix if the alleles are swapped, by default True. |
True
|
Returns:
| Type | Description |
|---|---|
LDMatrix
|
Object containing the LD matrix and the Variant IDs. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of variants in the map file does not match the number of rows in the LD matrix. |
Notes
Future enhancements planned:
- Support for npz files (partially implemented)
- Support for plink bin4 format
- Support for ldstore bcor format
The function validates that the LD matrix and map file have consistent dimensions and optionally sorts alleles for consistent representation.
Examples:
>>> ld_matrix = load_ld('data.ld', 'data.ldmap')
>>> print(f"Loaded LD matrix with {ld_matrix.r.shape[0]} variants")
Loaded LD matrix with 1000 variants
Source code in credtools/ldmatrix.py
load_ld_map(map_path, delimiter='\t')
¶
Read Variant IDs from a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
map_path
|
str
|
Path to the input text file containing the Variant IDs. |
required |
delimiter
|
str
|
Delimiter used in the input file, by default "\t". |
'\t'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing the Variant IDs with columns CHR, BP, A1, A2, and SNPID. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input file is empty or does not contain the required columns. |
Notes
This function assumes that the input file contains the required columns:
- Chromosome (CHR)
- Base pair position (BP)
- Allele 1 (A1)
- Allele 2 (A2)
The function performs data cleaning including:
- Converting chromosome and position to appropriate types
- Validating alleles are valid DNA bases (A, C, G, T)
- Removing variants where A1 == A2
- Creating unique SNPID identifiers
Examples:
>>> # Create sample map file
>>> contents = "CHR\\tBP\\tA1\\tA2\\n1\\t1000\\tA\\tG\\n1\\t2000\\tC\\tT\\n2\\t3000\\tT\\tC"
>>> with open('map.txt', 'w') as file:
... file.write(contents)
>>> df = load_ld_map('map.txt')
>>> print(df)
SNPID CHR BP A1 A2
0 1-1000-A-G 1 1000 A G
1 1-2000-C-T 1 2000 C T
2 2-3000-C-T 2 3000 T C
Source code in credtools/ldmatrix.py
load_ld_matrix(file_path, delimiter='\t')
¶
Convert a lower triangle matrix from a file to a symmetric square matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the input text file containing the lower triangle matrix. |
required |
delimiter
|
str
|
Delimiter used in the input file, by default "\t". |
'\t'
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Symmetric square matrix with diagonal filled with 1. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input file is empty or does not contain a valid lower triangle matrix. |
FileNotFoundError
|
If the specified file does not exist. |
Notes
This function assumes that the input file contains a valid lower triangle matrix with each row on a new line and elements separated by the specified delimiter. For .npz files, it loads the first array key in the file.
Examples:
>>> # Assuming 'lower_triangle.txt' contains:
>>> # 1.0
>>> # 0.1 1.0
>>> # 0.2 0.4 1.0
>>> # 0.3 0.5 0.6 1.0
>>> matrix = load_ld_matrix('lower_triangle.txt')
>>> print(matrix)
array([[1. , 0.1 , 0.2 , 0.3 ],
[0.1 , 1. , 0.4 , 0.5 ],
[0.2 , 0.4 , 1. , 0.6 ],
[0.3 , 0.5 , 0.6 , 1. ]])
Source code in credtools/ldmatrix.py
read_lower_triangle(file_path, delimiter='\t')
¶
Read a lower triangle matrix from a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the input text file containing the lower triangle matrix. |
required |
delimiter
|
str
|
Delimiter used in the input file, by default "\t". |
'\t'
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Lower triangle matrix. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input file is empty or does not contain a valid lower triangle matrix. |
FileNotFoundError
|
If the specified file does not exist. |
Notes
This function reads a lower triangular matrix where each row contains elements from the diagonal down to that row position.
Source code in credtools/ldmatrix.py
sort_alleles(ld)
¶
Sort alleles in the LD map in alphabetical order. Change the sign of the LD matrix if the alleles are swapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ld
|
LDMatrix
|
LDMatrix object containing the Variant IDs and the LD matrix. |
required |
Returns:
| Type | Description |
|---|---|
LDMatrix
|
LDMatrix object containing the Variant IDs and the LD matrix with alleles sorted. |
Notes
This function ensures consistent allele ordering by:
- Sorting alleles alphabetically (A1 <= A2)
- Flipping the sign of LD correlations for variants where alleles were swapped
- Maintaining diagonal elements as 1.0
This is important for consistent merging across different datasets.
Examples:
>>> map_df = pd.DataFrame({
... 'SNPID': ['1-1000-A-G', '1-2000-C-T'],
... 'CHR': [1, 1],
... 'BP': [1000, 2000],
... 'A1': ['A', 'T'],
... 'A2': ['G', 'C']
... })
>>> r_matrix = np.array([[1. , 0.1],
... [0.1, 1. ]])
>>> ld = LDMatrix(map_df, r_matrix)
>>> sorted_ld = sort_alleles(ld)
>>> print(sorted_ld.map)
SNPID CHR BP A1 A2
0 1-1000-A-G 1 1000 A G
1 1-2000-C-T 1 2000 C T
>>> print(sorted_ld.r)
array([[ 1. , -0.1],
[-0.1, 1. ]])
Source code in credtools/ldmatrix.py
Summary Statistics¶
Functions for processing summary statistics data.
check_colnames(df)
¶
Check column names in the DataFrame and fill missing columns with None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame to check for column names. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with all required columns, filling missing ones with None. |
Notes
This function ensures that all required summary statistics columns are present in the DataFrame. Missing columns are added with None values.
Source code in credtools/sumstats.py
check_mandatory_cols(df)
¶
Check if the DataFrame contains all mandatory columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to check for mandatory columns. |
required |
Returns:
| Type | Description |
|---|---|
None
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If any mandatory columns are missing. |
Notes
Mandatory columns are defined in ColName.mandatory_cols and typically include essential fields like chromosome, position, alleles, effect size, and p-value.
Source code in credtools/sumstats.py
get_significant_snps(df, pvalue_threshold=5e-08, use_most_sig_if_no_sig=True)
¶
Retrieve significant SNPs from the input DataFrame based on a p-value threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input summary statistics containing SNP information. |
required |
pvalue_threshold
|
float
|
The p-value threshold for significance, by default 5e-8. |
5e-08
|
use_most_sig_if_no_sig
|
bool
|
Whether to return the most significant SNP if no SNP meets the threshold, by default True. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A DataFrame containing significant SNPs, sorted by p-value in ascending order. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no significant SNPs are found and |
KeyError
|
If required columns are not present in the input DataFrame. |
Notes
If no SNPs meet the significance threshold and use_most_sig_if_no_sig is True,
the function returns the SNP with the smallest p-value.
Examples:
>>> data = {
... 'SNPID': ['rs1', 'rs2', 'rs3'],
... 'P': [1e-9, 0.05, 1e-8]
... }
>>> df = pd.DataFrame(data)
>>> significant_snps = get_significant_snps(df, pvalue_threshold=5e-8)
>>> print(significant_snps)
SNPID P
0 rs1 1.000000e-09
2 rs3 1.000000e-08
Source code in credtools/sumstats.py
load_sumstats(filename, if_sort_alleles=True, sep=None, nrows=None, skiprows=0, comment=None, gzipped=None)
¶
Load summary statistics from a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
str
|
The path to the file containing the summary statistics. The header must contain the column names: CHR, BP, EA, NEA, EAF, BETA, SE, P. |
required |
if_sort_alleles
|
bool
|
Whether to sort alleles in alphabetical order, by default True. |
True
|
sep
|
Optional[str]
|
The delimiter to use. If None, the delimiter is inferred from the file, by default None. |
None
|
nrows
|
Optional[int]
|
Number of rows to read. If None, all rows are read, by default None. |
None
|
skiprows
|
int
|
Number of lines to skip at the start of the file, by default 0. |
0
|
comment
|
Optional[str]
|
Character to split comments in the file, by default None. |
None
|
gzipped
|
Optional[bool]
|
Whether the file is gzipped. If None, it is inferred from the file extension, by default None. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A DataFrame containing the loaded summary statistics. |
Notes
The function performs the following operations:
- Auto-detects file compression (gzip) from file extension
- Auto-detects delimiter (tab, comma, or space) from file content
- Loads the data using pandas.read_csv
- Applies comprehensive data munging and quality control
- Optionally sorts alleles for consistency
The function infers the delimiter if not provided and handles gzipped files automatically. Comprehensive quality control is applied including validation of chromosomes, positions, alleles, p-values, effect sizes, and frequencies.
Examples:
>>> # Load summary statistics with automatic format detection
>>> sumstats = load_sumstats('gwas_results.txt.gz')
>>> print(f"Loaded {len(sumstats)} variants")
Loaded 1000000 variants
>>> # Load with specific parameters
>>> sumstats = load_sumstats('gwas_results.csv', sep=',', nrows=10000)
>>> print(sumstats.columns.tolist())
['SNPID', 'CHR', 'BP', 'EA', 'NEA', 'EAF', 'BETA', 'SE', 'P', 'MAF', 'RSID']
Source code in credtools/sumstats.py
760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 | |
make_SNPID_unique(sumstat, remove_duplicates=True, col_chr=ColName.CHR, col_bp=ColName.BP, col_ea=ColName.EA, col_nea=ColName.NEA, col_p=ColName.P)
¶
Generate unique SNP identifiers to facilitate the combination of multiple summary statistics datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sumstat
|
DataFrame
|
The input summary statistics containing SNP information. |
required |
remove_duplicates
|
bool
|
Whether to remove duplicated SNPs, keeping the one with the smallest p-value, by default True. |
True
|
col_chr
|
str
|
The column name for chromosome information, by default ColName.CHR. |
CHR
|
col_bp
|
str
|
The column name for base-pair position information, by default ColName.BP. |
BP
|
col_ea
|
str
|
The column name for effect allele information, by default ColName.EA. |
EA
|
col_nea
|
str
|
The column name for non-effect allele information, by default ColName.NEA. |
NEA
|
col_p
|
str
|
The column name for p-value information, by default ColName.P. |
P
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The summary statistics DataFrame with unique SNPIDs, suitable for merging with other datasets. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If required columns are missing from the input DataFrame. |
ValueError
|
If the input DataFrame is empty or becomes empty after processing. |
Notes
This function constructs a unique SNPID by concatenating chromosome, base-pair position, and sorted alleles (EA and NEA). This unique identifier allows for efficient merging of multiple summary statistics without the need for extensive duplicate comparisons.
The unique SNPID format: "chr-bp-sortedEA-sortedNEA"
If duplicates are found and remove_duplicates is False, a suffix "-N" is added to make
identifiers unique, where N is the occurrence number.
Examples:
>>> data = {
... 'CHR': ['1', '1', '2'],
... 'BP': [12345, 12345, 67890],
... 'EA': ['A', 'A', 'G'],
... 'NEA': ['G', 'G', 'A'],
... 'RSID': ['rs1', 'rs2', 'rs3'],
... 'P': [1e-5, 1e-6, 1e-7]
... }
>>> df = pd.DataFrame(data)
>>> unique_df = make_SNPID_unique(df, remove_duplicates=True)
>>> print(unique_df)
SNPID CHR BP EA NEA RSID P
0 1-12345-A-G 1 12345 A G rs2 1.000000e-06
1 2-67890-A-G 2 67890 G A rs3 1.000000e-07
Source code in credtools/sumstats.py
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 | |
munge(df)
¶
Munge the summary statistics DataFrame by performing a series of transformations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The input DataFrame containing summary statistics. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The munged DataFrame with necessary transformations applied. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If any mandatory columns are missing. |
Notes
This function performs comprehensive data cleaning and standardization:
- Validates mandatory columns are present
- Removes entirely missing columns
- Cleans chromosome and position data
- Validates and standardizes allele information
- Creates unique SNP identifiers
- Validates p-values, effect sizes, and standard errors
- Processes allele frequencies
- Handles rsID information if present
The function applies strict quality control and may remove variants that don't meet validation criteria.
Source code in credtools/sumstats.py
munge_allele(df)
¶
Munge allele columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with allele columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged allele columns. |
Notes
This function:
- Removes rows with missing allele values
- Converts alleles to uppercase
- Validates alleles contain only valid DNA bases (A, C, G, T)
- Removes variants where effect allele equals non-effect allele
Invalid alleles and monomorphic variants are removed and logged.
Source code in credtools/sumstats.py
munge_beta(df)
¶
Munge beta column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with beta column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged beta column. |
Notes
This function:
- Converts beta values to numeric type
- Removes rows with missing beta values
- Converts to appropriate data type
Invalid beta values are removed and logged.
Source code in credtools/sumstats.py
munge_bp(df)
¶
Munge position column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with position column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged position column. |
Notes
This function:
- Removes rows with missing position values
- Converts position to numeric type
- Validates positions are within acceptable range (exclusive: > 0, < 300M)
- Converts to appropriate data type
Invalid position values are removed and logged.
Source code in credtools/sumstats.py
munge_chr(df)
¶
Munge chromosome column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with chromosome column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged chromosome column. |
Notes
This function:
- Removes rows with missing chromosome values
- Converts chromosome to string and removes 'chr' prefix
- Converts X chromosome to numeric value (23)
- Validates chromosome values are within acceptable range
- Converts to appropriate data type
Invalid chromosome values are removed and logged.
Source code in credtools/sumstats.py
munge_eaf(df)
¶
Munge effect allele frequency column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with effect allele frequency column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged effect allele frequency column. |
Notes
This function:
- Converts EAF values to numeric type
- Removes rows with missing EAF values
- Validates EAF values are within range [0, 1] (inclusive)
- Converts to appropriate data type
Invalid EAF values are removed and logged.
Source code in credtools/sumstats.py
munge_maf(df)
¶
Munge minor allele frequency column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with minor allele frequency column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged minor allele frequency column. |
Notes
This function:
- Converts MAF values to numeric type
- Removes rows with missing MAF values
- Converts frequencies > 0.5 to 1 - frequency (to ensure minor allele)
- Validates MAF values are within acceptable range
- Converts to appropriate data type
Invalid MAF values are removed and logged.
Source code in credtools/sumstats.py
munge_pvalue(df)
¶
Munge p-value column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with p-value column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged p-value column. |
Notes
This function:
- Converts p-values to numeric type
- Removes rows with missing p-values
- Validates p-values are within acceptable range (exclusive: > 0, < 1)
- Converts to appropriate data type
Invalid p-values are removed and logged.
Source code in credtools/sumstats.py
munge_rsid(df)
¶
Munge rsID column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with rsID column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged rsID column. |
Notes
This function converts the rsID column to the appropriate data type as defined in ColType.RSID.
Source code in credtools/sumstats.py
munge_se(df)
¶
Munge standard error column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with standard error column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with munged standard error column. |
Notes
This function:
- Converts standard error values to numeric type
- Removes rows with missing standard error values
- Validates standard errors are positive (exclusive: > 0)
- Converts to appropriate data type
Invalid standard error values are removed and logged.
Source code in credtools/sumstats.py
rm_col_allna(df)
¶
Remove columns from the DataFrame that are entirely NA.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame from which to remove columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A DataFrame with columns that are entirely NA removed. |
Notes
This function also converts empty strings to None before checking for all-NA columns. Columns that contain only missing values are dropped to reduce memory usage and improve processing efficiency.
Source code in credtools/sumstats.py
sort_alleles(df)
¶
Sort EA and NEA in alphabetical order. Change the sign of beta if EA is not sorted as the first allele.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame with allele columns. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with sorted allele columns. |
Notes
This function ensures consistent allele ordering by:
- Sorting effect allele (EA) and non-effect allele (NEA) alphabetically
- Flipping the sign of beta if alleles were swapped
- Adjusting effect allele frequency (EAF) if alleles were swapped (EAF = 1 - EAF)
This standardization is important for: - Consistent merging across datasets - Meta-analysis compatibility - LD matrix alignment