sumstats
Functions for processing summary statistics data.
check_colnames(df)
¶
Check column names in the DataFrame and fill missing columns with None.
Parameters¶
df : pd.DataFrame Input DataFrame to check for column names.
Returns¶
pd.DataFrame DataFrame with all required columns, filling missing ones with None.
Notes¶
This function ensures that all required summary statistics columns are present in the DataFrame. Missing columns are added with None values.
Source code in credtools/sumstats.py
check_mandatory_cols(df)
¶
Check if the DataFrame contains all mandatory columns.
Parameters¶
df : pd.DataFrame The DataFrame to check for mandatory columns.
Returns¶
None
Raises¶
ValueError If any mandatory columns are missing.
Notes¶
Mandatory columns are defined in ColName.mandatory_cols and typically include essential fields like chromosome, position, alleles, effect size, and p-value.
Source code in credtools/sumstats.py
get_significant_snps(df, pvalue_threshold=5e-08, use_most_sig_if_no_sig=True)
¶
Retrieve significant SNPs from the input DataFrame based on a p-value threshold.
Parameters¶
df : pd.DataFrame The input summary statistics containing SNP information. pvalue_threshold : float, optional The p-value threshold for significance, by default 5e-8. use_most_sig_if_no_sig : bool, optional Whether to return the most significant SNP if no SNP meets the threshold, by default True.
Returns¶
pd.DataFrame A DataFrame containing significant SNPs, sorted by p-value in ascending order.
Raises¶
ValueError
If no significant SNPs are found and use_most_sig_if_no_sig
is False,
or if the DataFrame is empty.
KeyError
If required columns are not present in the input DataFrame.
Notes¶
If no SNPs meet the significance threshold and use_most_sig_if_no_sig
is True,
the function returns the SNP with the smallest p-value.
Examples¶
data = { ... 'SNPID': ['rs1', 'rs2', 'rs3'], ... 'P': [1e-9, 0.05, 1e-8] ... } df = pd.DataFrame(data) significant_snps = get_significant_snps(df, pvalue_threshold=5e-8) print(significant_snps) SNPID P 0 rs1 1.000000e-09 2 rs3 1.000000e-08
Source code in credtools/sumstats.py
load_sumstats(filename, if_sort_alleles=True, sep=None, nrows=None, skiprows=0, comment=None, gzipped=None)
¶
Load summary statistics from a file.
Parameters¶
filename : str The path to the file containing the summary statistics. The header must contain the column names: CHR, BP, EA, NEA, EAF, BETA, SE, P. if_sort_alleles : bool, optional Whether to sort alleles in alphabetical order, by default True. sep : Optional[str], optional The delimiter to use. If None, the delimiter is inferred from the file, by default None. nrows : Optional[int], optional Number of rows to read. If None, all rows are read, by default None. skiprows : int, optional Number of lines to skip at the start of the file, by default 0. comment : Optional[str], optional Character to split comments in the file, by default None. gzipped : Optional[bool], optional Whether the file is gzipped. If None, it is inferred from the file extension, by default None.
Returns¶
pd.DataFrame A DataFrame containing the loaded summary statistics.
Notes¶
The function performs the following operations:
- Auto-detects file compression (gzip) from file extension
- Auto-detects delimiter (tab, comma, or space) from file content
- Loads the data using pandas.read_csv
- Applies comprehensive data munging and quality control
- Optionally sorts alleles for consistency
The function infers the delimiter if not provided and handles gzipped files automatically. Comprehensive quality control is applied including validation of chromosomes, positions, alleles, p-values, effect sizes, and frequencies.
Examples¶
Load summary statistics with automatic format detection¶
sumstats = load_sumstats('gwas_results.txt.gz') print(f"Loaded {len(sumstats)} variants") Loaded 1000000 variants
Load with specific parameters¶
sumstats = load_sumstats('gwas_results.csv', sep=',', nrows=10000) print(sumstats.columns.tolist()) ['SNPID', 'CHR', 'BP', 'EA', 'NEA', 'EAF', 'BETA', 'SE', 'P', 'MAF', 'RSID']
Source code in credtools/sumstats.py
749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 |
|
make_SNPID_unique(sumstat, remove_duplicates=True, col_chr=ColName.CHR, col_bp=ColName.BP, col_ea=ColName.EA, col_nea=ColName.NEA, col_p=ColName.P)
¶
Generate unique SNP identifiers to facilitate the combination of multiple summary statistics datasets.
Parameters¶
sumstat : pd.DataFrame The input summary statistics containing SNP information. remove_duplicates : bool, optional Whether to remove duplicated SNPs, keeping the one with the smallest p-value, by default True. col_chr : str, optional The column name for chromosome information, by default ColName.CHR. col_bp : str, optional The column name for base-pair position information, by default ColName.BP. col_ea : str, optional The column name for effect allele information, by default ColName.EA. col_nea : str, optional The column name for non-effect allele information, by default ColName.NEA. col_p : str, optional The column name for p-value information, by default ColName.P.
Returns¶
pd.DataFrame The summary statistics DataFrame with unique SNPIDs, suitable for merging with other datasets.
Raises¶
KeyError If required columns are missing from the input DataFrame. ValueError If the input DataFrame is empty or becomes empty after processing.
Notes¶
This function constructs a unique SNPID by concatenating chromosome, base-pair position, and sorted alleles (EA and NEA). This unique identifier allows for efficient merging of multiple summary statistics without the need for extensive duplicate comparisons.
The unique SNPID format: "chr-bp-sortedEA-sortedNEA"
If duplicates are found and remove_duplicates
is False, a suffix "-N" is added to make
identifiers unique, where N is the occurrence number.
Examples¶
data = { ... 'CHR': ['1', '1', '2'], ... 'BP': [12345, 12345, 67890], ... 'EA': ['A', 'A', 'G'], ... 'NEA': ['G', 'G', 'A'], ... 'rsID': ['rs1', 'rs2', 'rs3'], ... 'P': [1e-5, 1e-6, 1e-7] ... } df = pd.DataFrame(data) unique_df = make_SNPID_unique(df, remove_duplicates=True) print(unique_df) SNPID CHR BP EA NEA rsID P 0 1-12345-A-G 1 12345 A G rs2 1.000000e-06 1 2-67890-A-G 2 67890 G A rs3 1.000000e-07
Source code in credtools/sumstats.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
|
munge(df)
¶
Munge the summary statistics DataFrame by performing a series of transformations.
Parameters¶
df : pd.DataFrame The input DataFrame containing summary statistics.
Returns¶
pd.DataFrame The munged DataFrame with necessary transformations applied.
Raises¶
ValueError If any mandatory columns are missing.
Notes¶
This function performs comprehensive data cleaning and standardization:
- Validates mandatory columns are present
- Removes entirely missing columns
- Cleans chromosome and position data
- Validates and standardizes allele information
- Creates unique SNP identifiers
- Validates p-values, effect sizes, and standard errors
- Processes allele frequencies
- Handles rsID information if present
The function applies strict quality control and may remove variants that don't meet validation criteria.
Source code in credtools/sumstats.py
munge_allele(df)
¶
Munge allele columns.
Parameters¶
df : pd.DataFrame Input DataFrame with allele columns.
Returns¶
pd.DataFrame DataFrame with munged allele columns.
Notes¶
This function:
- Removes rows with missing allele values
- Converts alleles to uppercase
- Validates alleles contain only valid DNA bases (A, C, G, T)
- Removes variants where effect allele equals non-effect allele
Invalid alleles and monomorphic variants are removed and logged.
Source code in credtools/sumstats.py
munge_beta(df)
¶
Munge beta column.
Parameters¶
df : pd.DataFrame Input DataFrame with beta column.
Returns¶
pd.DataFrame DataFrame with munged beta column.
Notes¶
This function:
- Converts beta values to numeric type
- Removes rows with missing beta values
- Converts to appropriate data type
Invalid beta values are removed and logged.
Source code in credtools/sumstats.py
munge_bp(df)
¶
Munge position column.
Parameters¶
df : pd.DataFrame Input DataFrame with position column.
Returns¶
pd.DataFrame DataFrame with munged position column.
Notes¶
This function:
- Removes rows with missing position values
- Converts position to numeric type
- Validates positions are within acceptable range
- Converts to appropriate data type
Invalid position values are removed and logged.
Source code in credtools/sumstats.py
munge_chr(df)
¶
Munge chromosome column.
Parameters¶
df : pd.DataFrame Input DataFrame with chromosome column.
Returns¶
pd.DataFrame DataFrame with munged chromosome column.
Notes¶
This function:
- Removes rows with missing chromosome values
- Converts chromosome to string and removes 'chr' prefix
- Converts X chromosome to numeric value (23)
- Validates chromosome values are within acceptable range
- Converts to appropriate data type
Invalid chromosome values are removed and logged.
Source code in credtools/sumstats.py
munge_eaf(df)
¶
Munge effect allele frequency column.
Parameters¶
df : pd.DataFrame Input DataFrame with effect allele frequency column.
Returns¶
pd.DataFrame DataFrame with munged effect allele frequency column.
Notes¶
This function:
- Converts EAF values to numeric type
- Removes rows with missing EAF values
- Validates EAF values are within range [0, 1]
- Converts to appropriate data type
Invalid EAF values are removed and logged.
Source code in credtools/sumstats.py
munge_maf(df)
¶
Munge minor allele frequency column.
Parameters¶
df : pd.DataFrame Input DataFrame with minor allele frequency column.
Returns¶
pd.DataFrame DataFrame with munged minor allele frequency column.
Notes¶
This function:
- Converts MAF values to numeric type
- Removes rows with missing MAF values
- Converts frequencies > 0.5 to 1 - frequency (to ensure minor allele)
- Validates MAF values are within acceptable range
- Converts to appropriate data type
Invalid MAF values are removed and logged.
Source code in credtools/sumstats.py
munge_pvalue(df)
¶
Munge p-value column.
Parameters¶
df : pd.DataFrame Input DataFrame with p-value column.
Returns¶
pd.DataFrame DataFrame with munged p-value column.
Notes¶
This function:
- Converts p-values to numeric type
- Removes rows with missing p-values
- Validates p-values are within acceptable range (0, 1)
- Converts to appropriate data type
Invalid p-values are removed and logged.
Source code in credtools/sumstats.py
munge_rsid(df)
¶
Munge rsID column.
Parameters¶
df : pd.DataFrame Input DataFrame with rsID column.
Returns¶
pd.DataFrame DataFrame with munged rsID column.
Notes¶
This function converts the rsID column to the appropriate data type as defined in ColType.RSID.
Source code in credtools/sumstats.py
munge_se(df)
¶
Munge standard error column.
Parameters¶
df : pd.DataFrame Input DataFrame with standard error column.
Returns¶
pd.DataFrame DataFrame with munged standard error column.
Notes¶
This function:
- Converts standard error values to numeric type
- Removes rows with missing standard error values
- Validates standard errors are positive
- Converts to appropriate data type
Invalid standard error values are removed and logged.
Source code in credtools/sumstats.py
rm_col_allna(df)
¶
Remove columns from the DataFrame that are entirely NA.
Parameters¶
df : pd.DataFrame The DataFrame from which to remove columns.
Returns¶
pd.DataFrame A DataFrame with columns that are entirely NA removed.
Notes¶
This function also converts empty strings to None before checking for all-NA columns. Columns that contain only missing values are dropped to reduce memory usage and improve processing efficiency.
Source code in credtools/sumstats.py
sort_alleles(df)
¶
Sort EA and NEA in alphabetical order. Change the sign of beta if EA is not sorted as the first allele.
Parameters¶
df : pd.DataFrame Input DataFrame with allele columns.
Returns¶
pd.DataFrame DataFrame with sorted allele columns.
Notes¶
This function ensures consistent allele ordering by:
- Sorting effect allele (EA) and non-effect allele (NEA) alphabetically
- Flipping the sign of beta if alleles were swapped
- Adjusting effect allele frequency (EAF) if alleles were swapped (EAF = 1 - EAF)
This standardization is important for: - Consistent merging across datasets - Meta-analysis compatibility - LD matrix alignment