Core Objects¶

Use these objects when you want to load data yourself and call CREDTOOLS from Python.

Locus Objects¶

Class for the input data of the fine-mapping analysis.

`Locus(popu, cohort, sample_size, sumstats, locus_start, locus_end, ld=None, if_intersect=False)` ¶

Locus class to represent a genomic locus with associated summary statistics and linkage disequilibrium (LD) matrix.

Parameters:

Name	Type	Description	Default
`popu`	`str`	Population code. e.g. "EUR". Choose from ["AFR", "AMR", "EAS", "EUR", "SAS"].	required
`cohort`	`str`	Cohort name.	required
`sample_size`	`int`	Sample size.	required
`sumstats`	`DataFrame`	Summary statistics DataFrame.	required
`ld`	`Optional[LDMatrix]`	LD matrix, by default None.	`None`
`if_intersect`	`bool`	Whether to intersect the LD matrix and summary statistics file, by default False.	`False`

Attributes:

Name	Type	Description
`original_sumstats`	`DataFrame`	The original summary statistics file.
`sumstats`	`DataFrame`	The processed summary statistics file.
`ld`	`LDMatrix`	The LD matrix object.
`chrom`	`int`	Chromosome.
`start`	`int`	Start position of the locus.
`end`	`int`	End position of the locus.
`n_snps`	`int`	Number of SNPs in the locus.
`prefix`	`str`	The prefix combining population and cohort.
`locus_id`	`str`	Unique identifier for the locus.
`is_matched`	`bool`	Whether the LD matrix and summary statistics file are matched.
`lambda_s`	`Optional[float]`	The estimated lambda_s parameter from estimate_s_rss function, None if not calculated.

Notes

If no LD matrix is provided, only ABF method can be used for fine-mapping.

Parameters:

Name	Type	Description	Default
`popu`	`str`	Population code. e.g. "EUR". Choose from ["AFR", "AMR", "EAS", "EUR", "SAS"].	required
`cohort`	`str`	Cohort name.	required
`sample_size`	`int`	Sample size.	required
`sumstats`	`DataFrame`	Summary statistics DataFrame.	required
`locus_start`	`int`	Fixed start position for the locus.	required
`locus_end`	`int`	Fixed end position for the locus.	required
`ld`	`Optional[LDMatrix]`	LD matrix, by default None.	`None`
`if_intersect`	`bool`	Whether to intersect the LD matrix and summary statistics file, by default False.	`False`

Warnings

If no LD matrix is provided, a warning is logged that only ABF method can be used.

Source code in credtools/locus.py

def __init__(
    self,
    popu: str,
    cohort: str,
    sample_size: int,
    sumstats: pd.DataFrame,
    locus_start: int,
    locus_end: int,
    ld: Optional[LDMatrix] = None,
    if_intersect: bool = False,
) -> None:
    """
    Initialize the Locus object.

    Parameters
    ----------
    popu : str
        Population code. e.g. "EUR". Choose from ["AFR", "AMR", "EAS", "EUR", "SAS"].
    cohort : str
        Cohort name.
    sample_size : int
        Sample size.
    sumstats : pd.DataFrame
        Summary statistics DataFrame.
    locus_start : int
        Fixed start position for the locus.
    locus_end : int
        Fixed end position for the locus.
    ld : Optional[LDMatrix], optional
        LD matrix, by default None.
    if_intersect : bool, optional
        Whether to intersect the LD matrix and summary statistics file, by default False.

    Warnings
    --------
    If no LD matrix is provided, a warning is logged that only ABF method can be used.
    """
    self.sumstats = sumstats
    self._original_sumstats = self.sumstats.copy()
    self._popu = popu
    self._cohort = cohort
    self._sample_size = sample_size
    self._locus_start = locus_start
    self._locus_end = locus_end
    self.lambda_s = None
    if ld:
        self.ld = ld
        if if_intersect:
            inters = intersect_sumstat_ld(self)
            self.sumstats = inters.sumstats
            self.ld = inters.ld
    else:
        logger.warning("LD matrix and map file not found. Can only run ABF method.")
        self.ld = LDMatrix(pd.DataFrame(), np.array([]))

`chrom` `property` ¶

Get the chromosome.

`cohort` `property` ¶

Get the cohort name.

`end` `property` ¶

Get the end position.

`is_matched` `property` ¶

Check if the LD matrix and sumstats file are matched.

`locus_id` `property` ¶

Get the locus ID.

`n_snps` `property` ¶

Get the number of SNPs.

`original_sumstats` `property` ¶

Get the original sumstats file.

`popu` `property` ¶

Get the population code.

`prefix` `property` ¶

Get the prefix of the locus.

`sample_size` `property` ¶

Get the sample size.

`start` `property` ¶

Get the start position.

`repr()` ¶

Return a string representation of the Locus object.

Returns:

Type	Description
`str`	String representation of the Locus object.

Source code in credtools/locus.py

def __repr__(self) -> str:
    """
    Return a string representation of the Locus object.

    Returns
    -------
    str
        String representation of the Locus object.
    """
    return f"Locus(popu={self.popu}, cohort={self.cohort}, sample_size={self.sample_size}, chr={self.chrom}, start={self.start}, end={self.end}, sumstats={self.sumstats.shape}, ld={self.ld.r.shape})"

`copy()` ¶

Copy the Locus object.

Returns:

Type	Description
`Locus`	A copy of the Locus object.

Source code in credtools/locus.py

def copy(self) -> "Locus":
    """
    Copy the Locus object.

    Returns
    -------
    Locus
        A copy of the Locus object.
    """
    new_locus = Locus(
        self.popu,
        self.cohort,
        self.sample_size,
        self.sumstats.copy(),
        self._locus_start,
        self._locus_end,
        self.ld.copy(),
        if_intersect=False,
    )
    new_locus.lambda_s = self.lambda_s
    return new_locus

`LocusSet(loci)` ¶

LocusSet class to represent a set of genomic loci.

Parameters:

Name	Type	Description	Default
`loci`	`List[Locus]`	List of Locus objects.	required

Attributes:

Name	Type	Description
`loci`	`List[Locus]`	List of Locus objects.
`n_loci`	`int`	Number of loci.
`chrom`	`int`	Chromosome number.
`start`	`int`	Start position of the locus.
`end`	`int`	End position of the locus.
`locus_id`	`str`	Unique identifier for the locus.

Raises:

Type	Description
`ValueError`	If the chromosomes of the loci are not the same.

Parameters:

Name	Type	Description	Default
`loci`	`List[Locus]`	List of Locus objects.	required

Source code in credtools/locus.py

def __init__(self, loci: List[Locus]) -> None:
    """
    Initialize the LocusSet object.

    Parameters
    ----------
    loci : List[Locus]
        List of Locus objects.
    """
    self.loci = loci

`chrom` `property` ¶

Get the chromosome.

Returns:

Type	Description
`int`	Chromosome number.

Raises:

Type	Description
`ValueError`	If the chromosomes of the loci are not the same.

`end` `property` ¶

Get the end position.

`locus_id` `property` ¶

Get the locus ID.

`n_loci` `property` ¶

Get the number of loci.

`start` `property` ¶

Get the start position.

`repr()` ¶

Return a string representation of the LocusSet object.

Returns:

Type	Description
`str`	String representation of the LocusSet object.

Source code in credtools/locus.py

def __repr__(self) -> str:
    """
    Return a string representation of the LocusSet object.

    Returns
    -------
    str
        String representation of the LocusSet object.
    """
    return (
        f"LocusSet(\n n_loci={len(self.loci)}, chrom={self.chrom}, start={self.start}, end={self.end}, locus_id={self.locus_id} \n"
        + "\n".join([locus.__repr__() for locus in self.loci])
        + "\n"
        + ")"
    )

`copy()` ¶

Copy the LocusSet object.

Returns:

Type	Description
`LocusSet`	A copy of the LocusSet object.

Source code in credtools/locus.py

def copy(self) -> "LocusSet":
    """
    Copy the LocusSet object.

    Returns
    -------
    LocusSet
        A copy of the LocusSet object.
    """
    return LocusSet([locus.copy() for locus in self.loci])

`check_loci_info(loci_info)` ¶

Check and validate loci information DataFrame.

Parameters:

Name	Type	Description	Default
`loci_info`	`DataFrame`	DataFrame containing loci information.	required

Returns:

Type	Description
`DataFrame`	Validated and type-corrected loci_info DataFrame.

Raises:

Type	Description
`ValueError`	If required columns are missing, data types are incorrect, or locus_id/boundary consistency checks fail.

Notes

This function performs the following checks: 1. Ensures all required columns are present 2. Validates and converts data types 3. Checks that loci with same locus_id have same chr, start, end 4. Validates chromosome, start, and end values

Source code in credtools/locus.py

def check_loci_info(loci_info: pd.DataFrame) -> pd.DataFrame:
    """
    Check and validate loci information DataFrame.

    Parameters
    ----------
    loci_info : pd.DataFrame
        DataFrame containing loci information.

    Returns
    -------
    pd.DataFrame
        Validated and type-corrected loci_info DataFrame.

    Raises
    ------
    ValueError
        If required columns are missing, data types are incorrect,
        or locus_id/boundary consistency checks fail.

    Notes
    -----
    This function performs the following checks:
    1. Ensures all required columns are present
    2. Validates and converts data types
    3. Checks that loci with same locus_id have same chr, start, end
    4. Validates chromosome, start, and end values
    """
    loci_info = loci_info.copy()

    # Check for required columns
    required_cols = [
        "prefix",
        "popu",
        "cohort",
        "sample_size",
        "chr",
        "start",
        "end",
        "locus_id",
    ]
    missing_cols = [col for col in required_cols if col not in loci_info.columns]
    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}")

    # Type checking and conversion
    try:
        # Convert numeric columns
        loci_info["sample_size"] = loci_info["sample_size"].astype(int)
        loci_info["chr"] = loci_info["chr"].astype(int)
        loci_info["start"] = loci_info["start"].astype(int)
        loci_info["end"] = loci_info["end"].astype(int)

        # Ensure string columns are strings
        loci_info["prefix"] = loci_info["prefix"].astype(str)
        loci_info["popu"] = loci_info["popu"].astype(str)
        loci_info["cohort"] = loci_info["cohort"].astype(str)
        loci_info["locus_id"] = loci_info["locus_id"].astype(str)

    except (ValueError, TypeError) as e:
        raise ValueError(f"Failed to convert data types: {e}")

    # Validate values
    if (loci_info["sample_size"] <= 0).any():
        raise ValueError("Sample size must be positive")

    if (loci_info["chr"] <= 0).any() or (loci_info["chr"] > 25).any():
        raise ValueError("Chromosome must be between 1 and 25")

    if (loci_info["start"] <= 0).any():
        raise ValueError("Start position must be positive")

    if (loci_info["end"] <= loci_info["start"]).any():
        raise ValueError("End position must be greater than start position")

    # Check for duplicates in popu+cohort+locus_id combination
    if loci_info.duplicated(subset=["popu", "cohort", "locus_id"]).any():
        raise ValueError("Each popu+cohort+locus_id combination must be unique")

    # Check consistency: same locus_id must have same chr, start, end
    locus_boundaries = loci_info.groupby("locus_id")[["chr", "start", "end"]].nunique()
    inconsistent_loci = locus_boundaries[(locus_boundaries > 1).any(axis=1)]

    if not inconsistent_loci.empty:
        raise ValueError(
            f"Inconsistent boundaries for locus_id(s): {inconsistent_loci.index.tolist()}. "
            "Each locus_id must have consistent chr, start, end values across all rows."
        )

    return loci_info

`intersect_sumstat_ld(locus)` ¶

Intersect the Variant IDs in the LD matrix and the sumstats file.

Parameters:

Name	Type	Description	Default
`locus`	`Locus`	Locus object containing LD matrix and summary statistics.	required

Returns:

Type	Description
`Locus`	Locus object containing the intersected LD matrix and sumstats file.

Raises:

Type	Description
`ValueError`	If LD matrix not found or no common Variant IDs found between the LD matrix and the sumstats file.

Warnings

If only a few common Variant IDs are found (≤ 10), a warning is logged.

Notes

This function performs the following operations:

Checks if LD matrix and summary statistics are already matched
Finds common SNP IDs between LD matrix and summary statistics
Subsets both datasets to common variants
Reorders data to maintain consistency
Returns a new Locus object with intersected data

Source code in credtools/locus.py

def intersect_sumstat_ld(locus: Locus) -> Locus:
    """
    Intersect the Variant IDs in the LD matrix and the sumstats file.

    Parameters
    ----------
    locus : Locus
        Locus object containing LD matrix and summary statistics.

    Returns
    -------
    Locus
        Locus object containing the intersected LD matrix and sumstats file.

    Raises
    ------
    ValueError
        If LD matrix not found or no common Variant IDs found between the LD matrix and the sumstats file.

    Warnings
    --------
    If only a few common Variant IDs are found (≤ 10), a warning is logged.

    Notes
    -----
    This function performs the following operations:

    1. Checks if LD matrix and summary statistics are already matched
    2. Finds common SNP IDs between LD matrix and summary statistics
    3. Subsets both datasets to common variants
    4. Reorders data to maintain consistency
    5. Returns a new Locus object with intersected data
    """
    if locus.ld is None:
        raise ValueError("LD matrix not found.")
    if locus.is_matched:
        logger.info("The LD matrix and sumstats file are matched.")
        return locus
    ldmap = locus.ld.map.copy()
    r = locus.ld.r.copy()
    sumstats = locus.sumstats.copy()
    sumstats = sumstats.sort_values([ColName.CHR, ColName.BP], ignore_index=True)
    intersec_sumstats = sumstats[
        sumstats[ColName.SNPID].isin(ldmap[ColName.SNPID])
    ].copy()
    intersec_variants = intersec_sumstats[ColName.SNPID].to_numpy()
    if len(intersec_variants) == 0:
        raise ValueError(
            f"No common Variant IDs found between the LD matrix and the sumstats file for locus {locus.locus_id}."
        )
    elif len(intersec_variants) <= 10:
        logger.warning(
            f"Only a few common Variant IDs found between the LD matrix and the sumstats file(<= 10) for locus {locus.locus_id}."
        )
    ldmap["idx"] = ldmap.index
    ldmap.set_index(ColName.SNPID, inplace=True, drop=False)
    ldmap = ldmap.loc[intersec_variants].copy()
    intersec_index = ldmap["idx"].to_numpy()
    r = r[intersec_index, :][:, intersec_index]
    intersec_sumstats.reset_index(drop=True, inplace=True)
    ldmap.drop("idx", axis=1, inplace=True)
    ldmap = ldmap.reset_index(drop=True)
    intersec_ld = LDMatrix(ldmap, r)
    logger.info(
        "Intersected the Variant IDs in the LD matrix and the sumstats file. "
        f"Number of common Variant IDs: {len(intersec_index)}"
    )
    return Locus(
        locus.popu,
        locus.cohort,
        locus.sample_size,
        intersec_sumstats,
        locus._locus_start,
        locus._locus_end,
        intersec_ld,
        if_intersect=False,
    )

`load_locus(prefix, popu, cohort, sample_size, locus_start, locus_end, if_intersect=False, calculate_lambda_s=False, **kwargs)` ¶

Load the input data of the fine-mapping analysis.

Parameters:

Name	Type	Description	Default
`prefix`	`str`	Prefix of the input files.	required
`popu`	`str`	Population of the input data.	required
`cohort`	`str`	Cohort of the input data.	required
`sample_size`	`int`	Sample size of the input data.	required
`locus_start`	`int`	Fixed start position for the locus.	required
`locus_end`	`int`	Fixed end position for the locus.	required
`if_intersect`	`bool`	Whether to intersect the input data with the LD matrix, by default False.	`False`
`calculate_lambda_s`	`bool`	Whether to calculate lambda_s parameter using estimate_s_rss function, by default False.	`False`
`**kwargs`	`Any`	Additional keyword arguments passed to loading functions.	`{}`

Returns:

Type	Description
`Locus`	Locus object containing the input data.

Raises:

Type	Description
`ValueError`	If the required input files are not found.

Notes

The function looks for files with the following patterns:

Summary statistics: {prefix}.sumstat or {prefix}.sumstats.gz
LD matrix: {prefix}.ld or {prefix}.ld.npz
LD map: {prefix}.ldmap or {prefix}.ldmap.gz

All files are required for proper functioning.

Examples:

>>> locus = load_locus('EUR_study1', 'EUR', 'study1', 50000)
>>> print(f"Loaded locus with {locus.n_snps} SNPs")
Loaded locus with 10000 SNPs

Source code in credtools/locus.py

def load_locus(
    prefix: str,
    popu: str,
    cohort: str,
    sample_size: int,
    locus_start: int,
    locus_end: int,
    if_intersect: bool = False,
    calculate_lambda_s: bool = False,
    **kwargs: Any,
) -> Locus:
    """
    Load the input data of the fine-mapping analysis.

    Parameters
    ----------
    prefix : str
        Prefix of the input files.
    popu : str
        Population of the input data.
    cohort : str
        Cohort of the input data.
    sample_size : int
        Sample size of the input data.
    locus_start : int
        Fixed start position for the locus.
    locus_end : int
        Fixed end position for the locus.
    if_intersect : bool, optional
        Whether to intersect the input data with the LD matrix, by default False.
    calculate_lambda_s : bool, optional
        Whether to calculate lambda_s parameter using estimate_s_rss function, by default False.
    **kwargs : Any
        Additional keyword arguments passed to loading functions.

    Returns
    -------
    Locus
        Locus object containing the input data.

    Raises
    ------
    ValueError
        If the required input files are not found.

    Notes
    -----
    The function looks for files with the following patterns:

    - Summary statistics: {prefix}.sumstat or {prefix}.sumstats.gz
    - LD matrix: {prefix}.ld or {prefix}.ld.npz
    - LD map: {prefix}.ldmap or {prefix}.ldmap.gz

    All files are required for proper functioning.

    Examples
    --------
    >>> locus = load_locus('EUR_study1', 'EUR', 'study1', 50000)
    >>> print(f"Loaded locus with {locus.n_snps} SNPs")
    Loaded locus with 10000 SNPs
    """
    if os.path.exists(f"{prefix}.sumstat"):
        sumstats_path = f"{prefix}.sumstat"
    elif os.path.exists(f"{prefix}.sumstats.gz"):
        sumstats_path = f"{prefix}.sumstats.gz"
    else:
        raise ValueError("Sumstats file not found.")

    sumstats = load_sumstats(sumstats_path, if_sort_alleles=True, **kwargs)
    if os.path.exists(f"{prefix}.ld"):
        ld_path = f"{prefix}.ld"
    elif os.path.exists(f"{prefix}.ld.npz"):
        ld_path = f"{prefix}.ld.npz"
    else:
        raise ValueError("LD matrix file not found.")
    if os.path.exists(f"{prefix}.ldmap"):
        ldmap_path = f"{prefix}.ldmap"
    elif os.path.exists(f"{prefix}.ldmap.gz"):
        ldmap_path = f"{prefix}.ldmap.gz"
    else:
        raise ValueError("LD map file not found.")
    ld = load_ld(ld_path, ldmap_path, if_sort_alleles=True, **kwargs)

    locus = Locus(
        popu,
        cohort,
        sample_size,
        sumstats,
        locus_start,
        locus_end,
        ld=ld,
        if_intersect=if_intersect,
    )

    if calculate_lambda_s:
        try:
            # Import here to avoid circular imports
            from credtools.qc import estimate_s_rss

            locus.lambda_s = estimate_s_rss(locus)
            logger.info(
                f"Calculated lambda_s for locus {locus.locus_id}: {locus.lambda_s}"
            )
        except Exception as e:
            logger.warning(
                f"Failed to calculate lambda_s for locus {locus.locus_id}: {e}"
            )
            locus.lambda_s = None

    return locus

`load_locus_set(locus_info, if_intersect=False, calculate_lambda_s=False, **kwargs)` ¶

Load the input data of the fine-mapping analysis for multiple loci.

Parameters:

Name	Type	Description	Default
`locus_info`	`DataFrame`	DataFrame containing the locus information with required columns: ['prefix', 'popu', 'cohort', 'sample_size', 'chr', 'start', 'end', 'locus_id'].	required
`if_intersect`	`bool`	Whether to intersect the input data with the LD matrix, by default False.	`False`
`calculate_lambda_s`	`bool`	Whether to calculate lambda_s parameter using estimate_s_rss function, by default False.	`False`
`**kwargs`	`Any`	Additional keyword arguments passed to load_locus function.	`{}`

Returns:

Type	Description
`LocusSet`	LocusSet object containing the input data.

Raises:

Type	Description
`ValueError`	If required columns are missing or if the combination of popu and cohort is not unique.

Notes

The locus_info DataFrame must contain the following columns:

prefix: File prefix for each locus
popu: Population code
cohort: Cohort name
sample_size: Sample size for the cohort
chr: Chromosome number
start: Start position of the locus
end: End position of the locus
locus_id: Locus identifier

All rows must have the same chr, start, end, locus_id values (representing the same locus).

Examples:

>>> locus_info = pd.DataFrame({
...     'prefix': ['EUR_study1', 'ASN_study2'],
...     'popu': ['EUR', 'ASN'],
...     'cohort': ['study1', 'study2'],
...     'sample_size': [50000, 30000]
... })
>>> locus_set = load_locus_set(locus_info)
>>> print(f"Loaded {locus_set.n_loci} loci")
Loaded 2 loci

Source code in credtools/locus.py

def load_locus_set(
    locus_info: pd.DataFrame,
    if_intersect: bool = False,
    calculate_lambda_s: bool = False,
    **kwargs: Any,
) -> LocusSet:
    """
    Load the input data of the fine-mapping analysis for multiple loci.

    Parameters
    ----------
    locus_info : pd.DataFrame
        DataFrame containing the locus information with required columns:
        ['prefix', 'popu', 'cohort', 'sample_size', 'chr', 'start', 'end', 'locus_id'].
    if_intersect : bool, optional
        Whether to intersect the input data with the LD matrix, by default False.
    calculate_lambda_s : bool, optional
        Whether to calculate lambda_s parameter using estimate_s_rss function, by default False.
    **kwargs : Any
        Additional keyword arguments passed to load_locus function.

    Returns
    -------
    LocusSet
        LocusSet object containing the input data.

    Raises
    ------
    ValueError
        If required columns are missing or if the combination of popu and cohort is not unique.

    Notes
    -----
    The locus_info DataFrame must contain the following columns:

    - prefix: File prefix for each locus
    - popu: Population code
    - cohort: Cohort name
    - sample_size: Sample size for the cohort
    - chr: Chromosome number
    - start: Start position of the locus
    - end: End position of the locus
    - locus_id: Locus identifier

    All rows must have the same chr, start, end, locus_id values (representing the same locus).

    Examples
    --------
    >>> locus_info = pd.DataFrame({
    ...     'prefix': ['EUR_study1', 'ASN_study2'],
    ...     'popu': ['EUR', 'ASN'],
    ...     'cohort': ['study1', 'study2'],
    ...     'sample_size': [50000, 30000]
    ... })
    >>> locus_set = load_locus_set(locus_info)
    >>> print(f"Loaded {locus_set.n_loci} loci")
    Loaded 2 loci
    """
    # Check and validate the locus_info DataFrame
    locus_info = check_loci_info(locus_info)

    # Check that all rows have the same chr, start, end (same locus)
    if len(locus_info["chr"].unique()) > 1:
        raise ValueError("All rows must have the same chromosome")
    if len(locus_info["start"].unique()) > 1:
        raise ValueError("All rows must have the same start position")
    if len(locus_info["end"].unique()) > 1:
        raise ValueError("All rows must have the same end position")
    if len(locus_info["locus_id"].unique()) > 1:
        raise ValueError("All rows must have the same locus_id")

    # Additional check for load_locus_set: popu+cohort must be unique within this single locus
    if locus_info.duplicated(subset=["popu", "cohort"]).any():
        raise ValueError(
            "Each popu+cohort combination must be unique within a single locus"
        )

    loci = []
    for i, row in locus_info.iterrows():
        loci.append(
            load_locus(
                row["prefix"],
                row["popu"],
                row["cohort"],
                row["sample_size"],
                int(row["start"]),
                int(row["end"]),
                if_intersect,
                calculate_lambda_s,
                **kwargs,
            )
        )
    return LocusSet(loci)

LD Matrices¶

Functions for reading and converting lower triangle matrices.

`LDMatrix(map_df, r)` ¶

Class to store the LD matrix and the corresponding Variant IDs.

Parameters:

Name	Type	Description	Default
`map_df`	`DataFrame`	DataFrame containing the Variant IDs.	required
`r`	`ndarray`	LD matrix.	required

Attributes:

Name	Type	Description
`map`	`DataFrame`	DataFrame containing the Variant IDs.
`r`	`ndarray`	LD matrix.

Raises:

Type	Description
`ValueError`	If the number of rows in the map file does not match the number of rows in the LD matrix.

Parameters:

Name	Type	Description	Default
`map_df`	`DataFrame`	DataFrame containing the Variant IDs.	required
`r`	`ndarray`	LD matrix.	required

Raises:

Type	Description
`ValueError`	If the number of rows in the map file does not match the number of rows in the LD matrix.

Source code in credtools/ldmatrix.py

def __init__(self, map_df: pd.DataFrame, r: np.ndarray) -> None:
    """
    Initialize the LDMatrix object.

    Parameters
    ----------
    map_df : pd.DataFrame
        DataFrame containing the Variant IDs.
    r : np.ndarray
        LD matrix.

    Raises
    ------
    ValueError
        If the number of rows in the map file does not match the number of rows in the LD matrix.
    """
    self.map = map_df
    self.r = r
    self.__check_length()

`__check_length()` ¶

Check if the number of rows in the map file matches the number of rows in the LD matrix.

Raises:

Type	Description
`ValueError`	If the number of rows in the map file does not match the number of rows in the LD matrix.

Source code in credtools/ldmatrix.py

def __check_length(self) -> None:
    """
    Check if the number of rows in the map file matches the number of rows in the LD matrix.

    Raises
    ------
    ValueError
        If the number of rows in the map file does not match the number of rows in the LD matrix.
    """
    if len(self.map) != len(self.r):
        raise ValueError(
            "The number of rows in the map file does not match the number of rows in the LD matrix."
        )

`repr()` ¶

Return a string representation of the LDMatrix object.

Returns:

Type	Description
`str`	String representation showing the shapes of map and r.

Source code in credtools/ldmatrix.py

def __repr__(self) -> str:
    """
    Return a string representation of the LDMatrix object.

    Returns
    -------
    str
        String representation showing the shapes of map and r.
    """
    return f"LDMatrix(map={self.map.shape}, r={self.r.shape})"

`copy()` ¶

Return a copy of the LDMatrix object.

Returns:

Type	Description
`LDMatrix`	A copy of the LDMatrix object.

Source code in credtools/ldmatrix.py

def copy(self) -> "LDMatrix":
    """
    Return a copy of the LDMatrix object.

    Returns
    -------
    LDMatrix
        A copy of the LDMatrix object.
    """
    return LDMatrix(self.map.copy(), self.r.copy())

`load_ld(ld_path, map_path, delimiter='\t', if_sort_alleles=True)` ¶

Read LD matrices and Variant IDs from files. Pair each matrix with its corresponding Variant IDs.

Parameters:

Name	Type	Description	Default
`ld_path`	`str`	Path to the input text file containing the lower triangle matrix or .npz file.	required
`map_path`	`str`	Path to the input text file containing the Variant IDs.	required
`delimiter`	`str`	Delimiter used in the input file, by default "\t".	`'\t'`
`if_sort_alleles`	`bool`	Sort alleles in the LD map in alphabetical order and change the sign of the LD matrix if the alleles are swapped, by default True.	`True`

Returns:

Type	Description
`LDMatrix`	Object containing the LD matrix and the Variant IDs.

Raises:

Type	Description
`ValueError`	If the number of variants in the map file does not match the number of rows in the LD matrix.

Notes

Future enhancements planned:

Support for npz files (partially implemented)
Support for plink bin4 format
Support for ldstore bcor format

The function validates that the LD matrix and map file have consistent dimensions and optionally sorts alleles for consistent representation.

Examples:

>>> ld_matrix = load_ld('data.ld', 'data.ldmap')
>>> print(f"Loaded LD matrix with {ld_matrix.r.shape[0]} variants")
Loaded LD matrix with 1000 variants

Source code in credtools/ldmatrix.py

def load_ld(
    ld_path: str, map_path: str, delimiter: str = "\t", if_sort_alleles: bool = True
) -> LDMatrix:
    r"""
    Read LD matrices and Variant IDs from files. Pair each matrix with its corresponding Variant IDs.

    Parameters
    ----------
    ld_path : str
        Path to the input text file containing the lower triangle matrix or .npz file.
    map_path : str
        Path to the input text file containing the Variant IDs.
    delimiter : str, optional
        Delimiter used in the input file, by default "\t".
    if_sort_alleles : bool, optional
        Sort alleles in the LD map in alphabetical order and change the sign of the
        LD matrix if the alleles are swapped, by default True.

    Returns
    -------
    LDMatrix
        Object containing the LD matrix and the Variant IDs.

    Raises
    ------
    ValueError
        If the number of variants in the map file does not match the number of rows in the LD matrix.

    Notes
    -----
    Future enhancements planned:

    - Support for npz files (partially implemented)
    - Support for plink bin4 format
    - Support for ldstore bcor format

    The function validates that the LD matrix and map file have consistent dimensions
    and optionally sorts alleles for consistent representation.

    Examples
    --------
    >>> ld_matrix = load_ld('data.ld', 'data.ldmap')
    >>> print(f"Loaded LD matrix with {ld_matrix.r.shape[0]} variants")
    Loaded LD matrix with 1000 variants
    """
    ld_df = load_ld_matrix(ld_path, delimiter)
    logger.info(f"Loaded LD matrix with shape {ld_df.shape} from '{ld_path}'.")
    map_df = load_ld_map(map_path, delimiter)
    logger.info(f"Loaded map file with shape {map_df.shape} from '{map_path}'.")
    if ld_df.shape[0] != map_df.shape[0]:
        raise ValueError(
            "The number of variants in the map file does not match the number of rows in the LD matrix.\n"
            f"Number of variants in the map file: {map_df.shape[0]}, number of rows in the LD matrix: {ld_df.shape[0]}"
            f"ld_path: {ld_path}, map_path: {map_path}"
        )
    ld = LDMatrix(map_df, ld_df)
    if if_sort_alleles:
        ld = sort_alleles(ld)

    return ld

`load_ld_map(map_path, delimiter='\t')` ¶

Read Variant IDs from a file.

Parameters:

Name	Type	Description	Default
`map_path`	`str`	Path to the input text file containing the Variant IDs.	required
`delimiter`	`str`	Delimiter used in the input file, by default "\t".	`'\t'`

Returns:

Type	Description
`DataFrame`	DataFrame containing the Variant IDs with columns CHR, BP, A1, A2, and SNPID.

Raises:

Type	Description
`ValueError`	If the input file is empty or does not contain the required columns.

Notes

This function assumes that the input file contains the required columns:

Chromosome (CHR)
Base pair position (BP)
Allele 1 (A1)
Allele 2 (A2)

The function performs data cleaning including:

Converting chromosome and position to appropriate types
Validating alleles are valid DNA bases (A, C, G, T)
Removing variants where A1 == A2
Creating unique SNPID identifiers

Examples:

>>> # Create sample map file
>>> contents = "CHR\\tBP\\tA1\\tA2\\n1\\t1000\\tA\\tG\\n1\\t2000\\tC\\tT\\n2\\t3000\\tT\\tC"
>>> with open('map.txt', 'w') as file:
...     file.write(contents)
>>> df = load_ld_map('map.txt')
>>> print(df)
    SNPID       CHR    BP A1 A2
0   1-1000-A-G    1  1000  A  G
1   1-2000-C-T    1  2000  C  T
2   2-3000-C-T    2  3000  T  C

Source code in credtools/ldmatrix.py

def load_ld_map(map_path: str, delimiter: str = "\t") -> pd.DataFrame:
    r"""
    Read Variant IDs from a file.

    Parameters
    ----------
    map_path : str
        Path to the input text file containing the Variant IDs.
    delimiter : str, optional
        Delimiter used in the input file, by default "\t".

    Returns
    -------
    pd.DataFrame
        DataFrame containing the Variant IDs with columns CHR, BP, A1, A2, and SNPID.

    Raises
    ------
    ValueError
        If the input file is empty or does not contain the required columns.

    Notes
    -----
    This function assumes that the input file contains the required columns:

    - Chromosome (CHR)
    - Base pair position (BP)
    - Allele 1 (A1)
    - Allele 2 (A2)

    The function performs data cleaning including:

    - Converting chromosome and position to appropriate types
    - Validating alleles are valid DNA bases (A, C, G, T)
    - Removing variants where A1 == A2
    - Creating unique SNPID identifiers

    Examples
    --------
    >>> # Create sample map file
    >>> contents = "CHR\\tBP\\tA1\\tA2\\n1\\t1000\\tA\\tG\\n1\\t2000\\tC\\tT\\n2\\t3000\\tT\\tC"
    >>> with open('map.txt', 'w') as file:
    ...     file.write(contents)
    >>> df = load_ld_map('map.txt')
    >>> print(df)
        SNPID       CHR    BP A1 A2
    0   1-1000-A-G    1  1000  A  G
    1   1-2000-C-T    1  2000  C  T
    2   2-3000-C-T    2  3000  T  C
    """
    # TODO: use REF/ALT instead of A1/A2
    map_df = pd.read_csv(map_path, sep=delimiter)
    missing_cols = [col for col in ColName.map_cols if col not in map_df.columns]
    if missing_cols:
        raise ValueError(f"Missing columns in the input file: {missing_cols}")
    outdf = munge_chr(map_df)
    outdf = munge_bp(outdf)
    for col in [ColName.A1, ColName.A2]:
        pre_n = outdf.shape[0]
        outdf = outdf[outdf[col].notnull()]
        outdf[col] = outdf[col].astype(str).str.upper()
        outdf = outdf[outdf[col].str.match(r"^[ACGT]+$")]
        after_n = outdf.shape[0]
        logger.debug(f"Remove {pre_n - after_n} rows because of invalid {col}.")
    outdf = outdf[outdf[ColName.A1] != outdf[ColName.A2]]
    outdf = make_SNPID_unique(
        outdf, col_ea=ColName.A1, col_nea=ColName.A2, remove_duplicates=False
    )
    outdf.reset_index(drop=True, inplace=True)
    # TODO: check if allele frequency is available
    return outdf

`load_ld_matrix(file_path, delimiter='\t')` ¶

Convert a lower triangle matrix from a file to a symmetric square matrix.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the input text file containing the lower triangle matrix.	required
`delimiter`	`str`	Delimiter used in the input file, by default "\t".	`'\t'`

Returns:

Type	Description
`ndarray`	Symmetric square matrix with diagonal filled with 1.

Raises:

Type	Description
`ValueError`	If the input file is empty or does not contain a valid lower triangle matrix.
`FileNotFoundError`	If the specified file does not exist.

Notes

This function assumes that the input file contains a valid lower triangle matrix with each row on a new line and elements separated by the specified delimiter. For .npz files, it loads the first array key in the file.

Examples:

>>> # Assuming 'lower_triangle.txt' contains:
>>> # 1.0
>>> # 0.1 1.0
>>> # 0.2 0.4 1.0
>>> # 0.3 0.5 0.6 1.0
>>> matrix = load_ld_matrix('lower_triangle.txt')
>>> print(matrix)
array([[1.  , 0.1 , 0.2 , 0.3 ],
        [0.1 , 1.  , 0.4 , 0.5 ],
        [0.2 , 0.4 , 1.  , 0.6 ],
        [0.3 , 0.5 , 0.6 , 1.  ]])

Source code in credtools/ldmatrix.py

def load_ld_matrix(file_path: str, delimiter: str = "\t") -> np.ndarray:
    r"""
    Convert a lower triangle matrix from a file to a symmetric square matrix.

    Parameters
    ----------
    file_path : str
        Path to the input text file containing the lower triangle matrix.
    delimiter : str, optional
        Delimiter used in the input file, by default "\t".

    Returns
    -------
    np.ndarray
        Symmetric square matrix with diagonal filled with 1.

    Raises
    ------
    ValueError
        If the input file is empty or does not contain a valid lower triangle matrix.
    FileNotFoundError
        If the specified file does not exist.

    Notes
    -----
    This function assumes that the input file contains a valid lower triangle matrix
    with each row on a new line and elements separated by the specified delimiter.
    For .npz files, it loads the first array key in the file.

    Examples
    --------
    >>> # Assuming 'lower_triangle.txt' contains:
    >>> # 1.0
    >>> # 0.1 1.0
    >>> # 0.2 0.4 1.0
    >>> # 0.3 0.5 0.6 1.0
    >>> matrix = load_ld_matrix('lower_triangle.txt')
    >>> print(matrix)
    array([[1.  , 0.1 , 0.2 , 0.3 ],
            [0.1 , 1.  , 0.4 , 0.5 ],
            [0.2 , 0.4 , 1.  , 0.6 ],
            [0.3 , 0.5 , 0.6 , 1.  ]])
    """
    if file_path.endswith(".npz"):
        with np.load(file_path) as data:
            ld_file_key = data.files[0]
            matrix = data[ld_file_key].astype(np.float32)
        return np.nan_to_num(matrix, nan=0.0)
    lower_triangle = read_lower_triangle(file_path, delimiter)

    # Create the symmetric matrix
    symmetric_matrix = lower_triangle + lower_triangle.T

    # Fill the diagonal with 1
    np.fill_diagonal(symmetric_matrix, 1)

    # convert to float32
    symmetric_matrix = symmetric_matrix.astype(np.float32)

    # Replace any NaNs with 0 to avoid propagating missing LD values
    symmetric_matrix = np.nan_to_num(symmetric_matrix, nan=0.0)
    return symmetric_matrix

`read_lower_triangle(file_path, delimiter='\t')` ¶

Read a lower triangle matrix from a file.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the input text file containing the lower triangle matrix.	required
`delimiter`	`str`	Delimiter used in the input file, by default "\t".	`'\t'`

Returns:

Type	Description
`ndarray`	Lower triangle matrix.

Raises:

Type	Description
`ValueError`	If the input file is empty or does not contain a valid lower triangle matrix.
`FileNotFoundError`	If the specified file does not exist.

Notes

This function reads a lower triangular matrix where each row contains elements from the diagonal down to that row position.

Source code in credtools/ldmatrix.py

def read_lower_triangle(file_path: str, delimiter: str = "\t") -> np.ndarray:
    r"""
    Read a lower triangle matrix from a file.

    Parameters
    ----------
    file_path : str
        Path to the input text file containing the lower triangle matrix.
    delimiter : str, optional
        Delimiter used in the input file, by default "\t".

    Returns
    -------
    np.ndarray
        Lower triangle matrix.

    Raises
    ------
    ValueError
        If the input file is empty or does not contain a valid lower triangle matrix.
    FileNotFoundError
        If the specified file does not exist.

    Notes
    -----
    This function reads a lower triangular matrix where each row contains
    elements from the diagonal down to that row position.
    """
    try:
        if file_path.endswith(".gz"):
            with gzip.open(file_path, "rt") as file:
                rows = [
                    list(map(float, line.strip().split(delimiter)))
                    for line in file
                    if line.strip()
                ]
        else:
            with open(file_path, "r") as file:
                rows = [
                    list(map(float, line.strip().split(delimiter)))
                    for line in file
                    if line.strip()
                ]
    except FileNotFoundError:
        raise FileNotFoundError(f"The file '{file_path}' does not exist.")

    if not rows:
        raise ValueError("The input file is empty.")

    n = len(rows)
    lower_triangle = np.zeros((n, n))

    for i, row in enumerate(rows):
        if len(row) != i + 1:
            raise ValueError(
                f"Invalid number of elements in row {i + 1}. Expected {i + 1}, got {len(row)}."
            )
        lower_triangle[i, : len(row)] = row

    return lower_triangle

`sort_alleles(ld)` ¶

Sort alleles in the LD map in alphabetical order. Change the sign of the LD matrix if the alleles are swapped.

Parameters:

Name	Type	Description	Default
`ld`	`LDMatrix`	LDMatrix object containing the Variant IDs and the LD matrix.	required

Returns:

Type	Description
`LDMatrix`	LDMatrix object containing the Variant IDs and the LD matrix with alleles sorted.

Notes

This function ensures consistent allele ordering by:

Sorting alleles alphabetically (A1 <= A2)
Flipping the sign of LD correlations for variants where alleles were swapped
Maintaining diagonal elements as 1.0

This is important for consistent merging across different datasets.

Examples:

>>> map_df = pd.DataFrame({
...     'SNPID': ['1-1000-A-G', '1-2000-C-T'],
...     'CHR': [1, 1],
...     'BP': [1000, 2000],
...     'A1': ['A', 'T'],
...     'A2': ['G', 'C']
... })
>>> r_matrix = np.array([[1. , 0.1],
...                      [0.1, 1. ]])
>>> ld = LDMatrix(map_df, r_matrix)
>>> sorted_ld = sort_alleles(ld)
>>> print(sorted_ld.map)
    SNPID       CHR    BP A1 A2
0   1-1000-A-G    1  1000  A  G
1   1-2000-C-T    1  2000  C  T
>>> print(sorted_ld.r)
array([[ 1. , -0.1],
        [-0.1,  1. ]])

Source code in credtools/ldmatrix.py

def sort_alleles(ld: LDMatrix) -> LDMatrix:
    """
    Sort alleles in the LD map in alphabetical order. Change the sign of the LD matrix if the alleles are swapped.

    Parameters
    ----------
    ld : LDMatrix
        LDMatrix object containing the Variant IDs and the LD matrix.

    Returns
    -------
    LDMatrix
        LDMatrix object containing the Variant IDs and the LD matrix with alleles sorted.

    Notes
    -----
    This function ensures consistent allele ordering by:

    1. Sorting alleles alphabetically (A1 <= A2)
    2. Flipping the sign of LD correlations for variants where alleles were swapped
    3. Maintaining diagonal elements as 1.0

    This is important for consistent merging across different datasets.

    Examples
    --------
    >>> map_df = pd.DataFrame({
    ...     'SNPID': ['1-1000-A-G', '1-2000-C-T'],
    ...     'CHR': [1, 1],
    ...     'BP': [1000, 2000],
    ...     'A1': ['A', 'T'],
    ...     'A2': ['G', 'C']
    ... })
    >>> r_matrix = np.array([[1. , 0.1],
    ...                      [0.1, 1. ]])
    >>> ld = LDMatrix(map_df, r_matrix)
    >>> sorted_ld = sort_alleles(ld)
    >>> print(sorted_ld.map)
        SNPID       CHR    BP A1 A2
    0   1-1000-A-G    1  1000  A  G
    1   1-2000-C-T    1  2000  C  T
    >>> print(sorted_ld.r)
    array([[ 1. , -0.1],
            [-0.1,  1. ]])
    """
    ld_df = ld.r.copy()
    ld_map = ld.map.copy()
    ld_map[["sort_a1", "sort_a2"]] = np.sort(ld_map[[ColName.A1, ColName.A2]], axis=1)
    swapped_index = ld_map[ld_map[ColName.A1] != ld_map["sort_a1"]].index
    # Change the sign of the rows and columns the LD matrix if the alleles are swapped
    ld_df[swapped_index] *= -1
    ld_df[:, swapped_index] *= -1
    np.fill_diagonal(ld_df, 1)

    ld_map[ColName.A1] = ld_map["sort_a1"]
    ld_map[ColName.A2] = ld_map["sort_a2"]
    ld_map.drop(columns=["sort_a1", "sort_a2"], inplace=True)
    return LDMatrix(ld_map, ld_df)

Summary Statistics¶

Functions for processing summary statistics data.

`check_colnames(df)` ¶

Check column names in the DataFrame and fill missing columns with None.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame to check for column names.	required

Returns:

Type	Description
`DataFrame`	DataFrame with all required columns, filling missing ones with None.

Notes

This function ensures that all required summary statistics columns are present in the DataFrame. Missing columns are added with None values.

Source code in credtools/sumstats.py

def check_colnames(df: pd.DataFrame) -> pd.DataFrame:
    """
    Check column names in the DataFrame and fill missing columns with None.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame to check for column names.

    Returns
    -------
    pd.DataFrame
        DataFrame with all required columns, filling missing ones with None.

    Notes
    -----
    This function ensures that all required summary statistics columns are present
    in the DataFrame. Missing columns are added with None values.
    """
    outdf: pd.DataFrame = df.copy()
    for col in ColName.sumstat_cols:
        if col not in outdf.columns:
            outdf[col] = None
    return outdf[ColName.sumstat_cols]

`check_mandatory_cols(df)` ¶

Check if the DataFrame contains all mandatory columns.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame to check for mandatory columns.	required

Returns:

Type	Description
`None`

Raises:

Type	Description
`ValueError`	If any mandatory columns are missing.

Notes

Mandatory columns are defined in ColName.mandatory_cols and typically include essential fields like chromosome, position, alleles, effect size, and p-value.

Source code in credtools/sumstats.py

def check_mandatory_cols(df: pd.DataFrame) -> None:
    """
    Check if the DataFrame contains all mandatory columns.

    Parameters
    ----------
    df : pd.DataFrame
        The DataFrame to check for mandatory columns.

    Returns
    -------
    None

    Raises
    ------
    ValueError
        If any mandatory columns are missing.

    Notes
    -----
    Mandatory columns are defined in ColName.mandatory_cols and typically include
    essential fields like chromosome, position, alleles, effect size, and p-value.
    """
    outdf = df.copy()
    missing_cols = set(ColName.mandatory_cols) - set(outdf.columns)
    if missing_cols:
        raise ValueError(f"Missing mandatory columns: {missing_cols}")
    return None

`get_significant_snps(df, pvalue_threshold=5e-08, use_most_sig_if_no_sig=True)` ¶

Retrieve significant SNPs from the input DataFrame based on a p-value threshold.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input summary statistics containing SNP information.	required
`pvalue_threshold`	`float`	The p-value threshold for significance, by default 5e-8.	`5e-08`
`use_most_sig_if_no_sig`	`bool`	Whether to return the most significant SNP if no SNP meets the threshold, by default True.	`True`

Returns:

Type	Description
`DataFrame`	A DataFrame containing significant SNPs, sorted by p-value in ascending order.

Raises:

Type	Description
`ValueError`	If no significant SNPs are found and `use_most_sig_if_no_sig` is False, or if the DataFrame is empty.
`KeyError`	If required columns are not present in the input DataFrame.

Notes

If no SNPs meet the significance threshold and use_most_sig_if_no_sig is True, the function returns the SNP with the smallest p-value.

Examples:

>>> data = {
...     'SNPID': ['rs1', 'rs2', 'rs3'],
...     'P': [1e-9, 0.05, 1e-8]
... }
>>> df = pd.DataFrame(data)
>>> significant_snps = get_significant_snps(df, pvalue_threshold=5e-8)
>>> print(significant_snps)
    SNPID         P
0    rs1  1.000000e-09
2    rs3  1.000000e-08

Source code in credtools/sumstats.py

def get_significant_snps(
    df: pd.DataFrame,
    pvalue_threshold: float = 5e-8,
    use_most_sig_if_no_sig: bool = True,
) -> pd.DataFrame:
    """
    Retrieve significant SNPs from the input DataFrame based on a p-value threshold.

    Parameters
    ----------
    df : pd.DataFrame
        The input summary statistics containing SNP information.
    pvalue_threshold : float, optional
        The p-value threshold for significance, by default 5e-8.
    use_most_sig_if_no_sig : bool, optional
        Whether to return the most significant SNP if no SNP meets the threshold, by default True.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing significant SNPs, sorted by p-value in ascending order.

    Raises
    ------
    ValueError
        If no significant SNPs are found and `use_most_sig_if_no_sig` is False,
        or if the DataFrame is empty.
    KeyError
        If required columns are not present in the input DataFrame.

    Notes
    -----
    If no SNPs meet the significance threshold and `use_most_sig_if_no_sig` is True,
    the function returns the SNP with the smallest p-value.

    Examples
    --------
    >>> data = {
    ...     'SNPID': ['rs1', 'rs2', 'rs3'],
    ...     'P': [1e-9, 0.05, 1e-8]
    ... }
    >>> df = pd.DataFrame(data)
    >>> significant_snps = get_significant_snps(df, pvalue_threshold=5e-8)
    >>> print(significant_snps)
        SNPID         P
    0    rs1  1.000000e-09
    2    rs3  1.000000e-08
    """
    required_columns = {ColName.P, ColName.SNPID}
    missing_columns = required_columns - set(df.columns)
    if missing_columns:
        raise KeyError(
            f"The following required columns are missing from the DataFrame: {missing_columns}"
        )

    sig_df = df.loc[df[ColName.P] <= pvalue_threshold].copy()

    if sig_df.empty:
        if use_most_sig_if_no_sig:
            min_pvalue = df[ColName.P].min()
            sig_df = df.loc[df[ColName.P] == min_pvalue].copy()
            if sig_df.empty:
                raise ValueError("The DataFrame is empty. No SNPs available to select.")
            logging.debug(
                f"Using the most significant SNP: {sig_df.iloc[0][ColName.SNPID]}"
            )
            logging.debug(f"p-value: {sig_df.iloc[0][ColName.P]}")
        else:
            raise ValueError("No significant SNPs found.")
    else:
        sig_df.sort_values(by=ColName.P, inplace=True)
        sig_df.reset_index(drop=True, inplace=True)

    return sig_df

`load_sumstats(filename, if_sort_alleles=True, sep=None, nrows=None, skiprows=0, comment=None, gzipped=None)` ¶

Load summary statistics from a file.

Parameters:

Name	Type	Description	Default
`filename`	`str`	The path to the file containing the summary statistics. The header must contain the column names: CHR, BP, EA, NEA, EAF, BETA, SE, P.	required
`if_sort_alleles`	`bool`	Whether to sort alleles in alphabetical order, by default True.	`True`
`sep`	`Optional[str]`	The delimiter to use. If None, the delimiter is inferred from the file, by default None.	`None`
`nrows`	`Optional[int]`	Number of rows to read. If None, all rows are read, by default None.	`None`
`skiprows`	`int`	Number of lines to skip at the start of the file, by default 0.	`0`
`comment`	`Optional[str]`	Character to split comments in the file, by default None.	`None`
`gzipped`	`Optional[bool]`	Whether the file is gzipped. If None, it is inferred from the file extension, by default None.	`None`

Returns:

Type	Description
`DataFrame`	A DataFrame containing the loaded summary statistics.

Notes

The function performs the following operations:

Auto-detects file compression (gzip) from file extension
Auto-detects delimiter (tab, comma, or space) from file content
Loads the data using pandas.read_csv
Applies comprehensive data munging and quality control
Optionally sorts alleles for consistency

The function infers the delimiter if not provided and handles gzipped files automatically. Comprehensive quality control is applied including validation of chromosomes, positions, alleles, p-values, effect sizes, and frequencies.

Examples:

>>> # Load summary statistics with automatic format detection
>>> sumstats = load_sumstats('gwas_results.txt.gz')
>>> print(f"Loaded {len(sumstats)} variants")
Loaded 1000000 variants

>>> # Load with specific parameters
>>> sumstats = load_sumstats('gwas_results.csv', sep=',', nrows=10000)
>>> print(sumstats.columns.tolist())
['SNPID', 'CHR', 'BP', 'EA', 'NEA', 'EAF', 'BETA', 'SE', 'P', 'MAF', 'RSID']

Source code in credtools/sumstats.py

def load_sumstats(
    filename: str,
    if_sort_alleles: bool = True,
    sep: Optional[str] = None,
    nrows: Optional[int] = None,
    skiprows: int = 0,
    comment: Optional[str] = None,
    gzipped: Optional[bool] = None,
) -> pd.DataFrame:
    """
    Load summary statistics from a file.

    Parameters
    ----------
    filename : str
        The path to the file containing the summary statistics.
        The header must contain the column names: CHR, BP, EA, NEA, EAF, BETA, SE, P.
    if_sort_alleles : bool, optional
        Whether to sort alleles in alphabetical order, by default True.
    sep : Optional[str], optional
        The delimiter to use. If None, the delimiter is inferred from the file, by default None.
    nrows : Optional[int], optional
        Number of rows to read. If None, all rows are read, by default None.
    skiprows : int, optional
        Number of lines to skip at the start of the file, by default 0.
    comment : Optional[str], optional
        Character to split comments in the file, by default None.
    gzipped : Optional[bool], optional
        Whether the file is gzipped. If None, it is inferred from the file extension, by default None.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the loaded summary statistics.

    Notes
    -----
    The function performs the following operations:

    1. Auto-detects file compression (gzip) from file extension
    2. Auto-detects delimiter (tab, comma, or space) from file content
    3. Loads the data using pandas.read_csv
    4. Applies comprehensive data munging and quality control
    5. Optionally sorts alleles for consistency

    The function infers the delimiter if not provided and handles gzipped files automatically.
    Comprehensive quality control is applied including validation of chromosomes, positions,
    alleles, p-values, effect sizes, and frequencies.

    Examples
    --------
    >>> # Load summary statistics with automatic format detection
    >>> sumstats = load_sumstats('gwas_results.txt.gz')
    >>> print(f"Loaded {len(sumstats)} variants")
    Loaded 1000000 variants

    >>> # Load with specific parameters
    >>> sumstats = load_sumstats('gwas_results.csv', sep=',', nrows=10000)
    >>> print(sumstats.columns.tolist())
    ['SNPID', 'CHR', 'BP', 'EA', 'NEA', 'EAF', 'BETA', 'SE', 'P', 'MAF', 'RSID']
    """
    # determine whether the file is gzipped
    if gzipped is None:
        gzipped = filename.endswith("gz")

    # read the first line of the file to determine the separator
    if sep is None:
        if gzipped:
            f = gzip.open(filename, "rt")

        else:
            f = open(filename, "rt")
        if skiprows > 0:
            for _ in range(skiprows):
                f.readline()
        line = f.readline()
        f.close()
        if "\t" in line:
            sep = "\t"
        elif "," in line:
            sep = ","
        else:
            sep = " "
    logger.debug(f"File {filename} is gzipped: {gzipped}")
    logger.debug(f"Separator is {sep}")
    logger.debug(f"loading data from {filename}")
    # determine the separator, automatically if not specified
    sumstats = pd.read_csv(
        filename,
        sep=sep,
        nrows=nrows,
        skiprows=skiprows,
        comment=comment,
        compression="gzip" if gzipped else None,
    )
    sumstats = munge(sumstats)
    logger.info(f"Loaded {len(sumstats)} rows sumstats from {filename}")
    if if_sort_alleles:
        sumstats = sort_alleles(sumstats)
    return sumstats

`make_SNPID_unique(sumstat, remove_duplicates=True, col_chr=ColName.CHR, col_bp=ColName.BP, col_ea=ColName.EA, col_nea=ColName.NEA, col_p=ColName.P)` ¶

Generate unique SNP identifiers to facilitate the combination of multiple summary statistics datasets.

Parameters:

Name	Type	Description	Default
`sumstat`	`DataFrame`	The input summary statistics containing SNP information.	required
`remove_duplicates`	`bool`	Whether to remove duplicated SNPs, keeping the one with the smallest p-value, by default True.	`True`
`col_chr`	`str`	The column name for chromosome information, by default ColName.CHR.	`CHR`
`col_bp`	`str`	The column name for base-pair position information, by default ColName.BP.	`BP`
`col_ea`	`str`	The column name for effect allele information, by default ColName.EA.	`EA`
`col_nea`	`str`	The column name for non-effect allele information, by default ColName.NEA.	`NEA`
`col_p`	`str`	The column name for p-value information, by default ColName.P.	`P`

Returns:

Type	Description
`DataFrame`	The summary statistics DataFrame with unique SNPIDs, suitable for merging with other datasets.

Raises:

Type	Description
`KeyError`	If required columns are missing from the input DataFrame.
`ValueError`	If the input DataFrame is empty or becomes empty after processing.

Notes

This function constructs a unique SNPID by concatenating chromosome, base-pair position, and sorted alleles (EA and NEA). This unique identifier allows for efficient merging of multiple summary statistics without the need for extensive duplicate comparisons.

The unique SNPID format: "chr-bp-sortedEA-sortedNEA"

If duplicates are found and remove_duplicates is False, a suffix "-N" is added to make identifiers unique, where N is the occurrence number.

Examples:

>>> data = {
...     'CHR': ['1', '1', '2'],
...     'BP': [12345, 12345, 67890],
...     'EA': ['A', 'A', 'G'],
...     'NEA': ['G', 'G', 'A'],
...     'RSID': ['rs1', 'rs2', 'rs3'],
...     'P': [1e-5, 1e-6, 1e-7]
... }
>>> df = pd.DataFrame(data)
>>> unique_df = make_SNPID_unique(df, remove_duplicates=True)
>>> print(unique_df)
    SNPID       CHR     BP EA NEA RSID         P
0  1-12345-A-G    1  12345  A   G  rs2  1.000000e-06
1  2-67890-A-G    2  67890  G   A  rs3  1.000000e-07

Source code in credtools/sumstats.py

def make_SNPID_unique(
    sumstat: pd.DataFrame,
    remove_duplicates: bool = True,
    col_chr: str = ColName.CHR,
    col_bp: str = ColName.BP,
    col_ea: str = ColName.EA,
    col_nea: str = ColName.NEA,
    col_p: str = ColName.P,
) -> pd.DataFrame:
    """
    Generate unique SNP identifiers to facilitate the combination of multiple summary statistics datasets.

    Parameters
    ----------
    sumstat : pd.DataFrame
        The input summary statistics containing SNP information.
    remove_duplicates : bool, optional
        Whether to remove duplicated SNPs, keeping the one with the smallest p-value, by default True.
    col_chr : str, optional
        The column name for chromosome information, by default ColName.CHR.
    col_bp : str, optional
        The column name for base-pair position information, by default ColName.BP.
    col_ea : str, optional
        The column name for effect allele information, by default ColName.EA.
    col_nea : str, optional
        The column name for non-effect allele information, by default ColName.NEA.
    col_p : str, optional
        The column name for p-value information, by default ColName.P.

    Returns
    -------
    pd.DataFrame
        The summary statistics DataFrame with unique SNPIDs, suitable for merging with other datasets.

    Raises
    ------
    KeyError
        If required columns are missing from the input DataFrame.
    ValueError
        If the input DataFrame is empty or becomes empty after processing.

    Notes
    -----
    This function constructs a unique SNPID by concatenating chromosome, base-pair position,
    and sorted alleles (EA and NEA). This unique identifier allows for efficient merging of
    multiple summary statistics without the need for extensive duplicate comparisons.

    The unique SNPID format: "chr-bp-sortedEA-sortedNEA"

    If duplicates are found and `remove_duplicates` is False, a suffix "-N" is added to make
    identifiers unique, where N is the occurrence number.

    Examples
    --------
    >>> data = {
    ...     'CHR': ['1', '1', '2'],
    ...     'BP': [12345, 12345, 67890],
    ...     'EA': ['A', 'A', 'G'],
    ...     'NEA': ['G', 'G', 'A'],
    ...     'RSID': ['rs1', 'rs2', 'rs3'],
    ...     'P': [1e-5, 1e-6, 1e-7]
    ... }
    >>> df = pd.DataFrame(data)
    >>> unique_df = make_SNPID_unique(df, remove_duplicates=True)
    >>> print(unique_df)
        SNPID       CHR     BP EA NEA RSID         P
    0  1-12345-A-G    1  12345  A   G  rs2  1.000000e-06
    1  2-67890-A-G    2  67890  G   A  rs3  1.000000e-07
    """
    required_columns = {
        col_chr,
        col_bp,
        col_ea,
        col_nea,
    }
    missing_columns = required_columns - set(sumstat.columns)
    if missing_columns:
        raise KeyError(
            f"The following required columns are missing from the DataFrame: {missing_columns}"
        )

    if sumstat.empty:
        raise ValueError("The input DataFrame is empty.")

    df = sumstat.copy()

    # Sort alleles to ensure unique representation (EA <= NEA)
    allele_df = df[[col_ea, col_nea]].apply(
        lambda row: sorted([row[col_ea], row[col_nea]]), axis=1, result_type="expand"
    )
    allele_df.columns = [col_ea, col_nea]

    # Create unique SNPID
    df[ColName.SNPID] = (
        df[col_chr].astype(str)
        + "-"
        + df[col_bp].astype(str)
        + "-"
        + allele_df[col_ea]
        + "-"
        + allele_df[col_nea]
    )

    # move SNPID to the first column
    cols = df.columns.tolist()
    cols.insert(0, cols.pop(cols.index(ColName.SNPID)))
    df = df[cols]

    n_duplicated = df.duplicated(subset=[ColName.SNPID]).sum()

    if remove_duplicates and n_duplicated > 0:
        logger.debug(f"Number of duplicated SNPs: {n_duplicated}")
        if col_p in df.columns:
            # Sort by p-value to keep the SNP with the smallest p-value
            df.sort_values(by=col_p, inplace=True)
        df.drop_duplicates(subset=[ColName.SNPID], keep="first", inplace=True)
        # Sort DataFrame by chromosome and base-pair position
        df.sort_values(by=[col_chr, col_bp], inplace=True)
        df.reset_index(drop=True, inplace=True)
    elif n_duplicated > 0 and not remove_duplicates:
        logger.warning(
            """Duplicated SNPs detected. To remove duplicates, set `remove_duplicates=True`.
            Change the Unique SNP identifier to make it unique."""
        )
        # Change the Unique SNP identifier to make it unique. add a number to the end of the SNP identifier
        #  for example, 1-12345-A-G to 1-12345-A-G-1, 1-12345-A-G-2, etc. no alteration to the original SNP identifier
        dup_tail = "-" + df.groupby(ColName.SNPID).cumcount().astype(str)
        dup_tail = dup_tail.str.replace("-0", "")
        df[ColName.SNPID] = df[ColName.SNPID] + dup_tail

    logger.debug("Unique SNPIDs have been successfully created.")
    logger.debug(f"Total unique SNPs: {len(df)}")

    return df

`munge(df)` ¶

Munge the summary statistics DataFrame by performing a series of transformations.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input DataFrame containing summary statistics.	required

Returns:

Type	Description
`DataFrame`	The munged DataFrame with necessary transformations applied.

Raises:

Type	Description
`ValueError`	If any mandatory columns are missing.

Notes

This function performs comprehensive data cleaning and standardization:

Validates mandatory columns are present
Removes entirely missing columns
Cleans chromosome and position data
Validates and standardizes allele information
Creates unique SNP identifiers
Validates p-values, effect sizes, and standard errors
Processes allele frequencies
Handles rsID information if present

The function applies strict quality control and may remove variants that don't meet validation criteria.

Source code in credtools/sumstats.py

def munge(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge the summary statistics DataFrame by performing a series of transformations.

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame containing summary statistics.

    Returns
    -------
    pd.DataFrame
        The munged DataFrame with necessary transformations applied.

    Raises
    ------
    ValueError
        If any mandatory columns are missing.

    Notes
    -----
    This function performs comprehensive data cleaning and standardization:

    1. Validates mandatory columns are present
    2. Removes entirely missing columns
    3. Cleans chromosome and position data
    4. Validates and standardizes allele information
    5. Creates unique SNP identifiers
    6. Validates p-values, effect sizes, and standard errors
    7. Processes allele frequencies
    8. Handles rsID information if present

    The function applies strict quality control and may remove variants
    that don't meet validation criteria.
    """
    check_mandatory_cols(df)
    outdf = df.copy()
    outdf = rm_col_allna(outdf)
    outdf = munge_chr(outdf)
    outdf = munge_bp(outdf)
    outdf = munge_allele(outdf)
    outdf = make_SNPID_unique(outdf)
    outdf = munge_pvalue(outdf)
    outdf = outdf.sort_values(by=[ColName.CHR, ColName.BP])
    outdf = munge_beta(outdf)
    outdf = munge_se(outdf)
    if ColName.EAF in outdf.columns:
        outdf = munge_eaf(outdf)
        outdf[ColName.MAF] = outdf[ColName.EAF]
        outdf = munge_maf(outdf)
    if ColName.RSID in outdf.columns:
        outdf = munge_rsid(outdf)
    outdf = check_colnames(outdf)
    return outdf

`munge_allele(df)` ¶

Munge allele columns.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with allele columns.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged allele columns.

Notes

This function:

Removes rows with missing allele values
Converts alleles to uppercase
Validates alleles contain only valid DNA bases (A, C, G, T)
Removes variants where effect allele equals non-effect allele

Invalid alleles and monomorphic variants are removed and logged.

Source code in credtools/sumstats.py

def munge_allele(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge allele columns.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with allele columns.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged allele columns.

    Notes
    -----
    This function:

    1. Removes rows with missing allele values
    2. Converts alleles to uppercase
    3. Validates alleles contain only valid DNA bases (A, C, G, T)
    4. Removes variants where effect allele equals non-effect allele

    Invalid alleles and monomorphic variants are removed and logged.
    """
    validate = _get_validate_and_clean_column()
    _transform_allele = _get_transform_allele()
    outdf = df.copy()
    for col in [ColName.EA, ColName.NEA]:
        outdf = validate(
            df=outdf,
            col_name=col,
            col_type=ColType.EA,
            allow_na=ColAllowNA.EA,
            transform_func=_transform_allele,
        )
    outdf = outdf[outdf[ColName.EA] != outdf[ColName.NEA]]
    return outdf

`munge_beta(df)` ¶

Munge beta column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with beta column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged beta column.

Notes

This function:

Converts beta values to numeric type
Removes rows with missing beta values
Converts to appropriate data type

Invalid beta values are removed and logged.

Source code in credtools/sumstats.py

def munge_beta(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge beta column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with beta column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged beta column.

    Notes
    -----
    This function:

    1. Converts beta values to numeric type
    2. Removes rows with missing beta values
    3. Converts to appropriate data type

    Invalid beta values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.BETA,
        col_type=ColType.BETA,
        allow_na=ColAllowNA.BETA,
    )

`munge_bp(df)` ¶

Munge position column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with position column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged position column.

Notes

This function:

Removes rows with missing position values
Converts position to numeric type
Validates positions are within acceptable range (exclusive: > 0, < 300M)
Converts to appropriate data type

Invalid position values are removed and logged.

Source code in credtools/sumstats.py

def munge_bp(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge position column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with position column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged position column.

    Notes
    -----
    This function:

    1. Removes rows with missing position values
    2. Converts position to numeric type
    3. Validates positions are within acceptable range (exclusive: > 0, < 300M)
    4. Converts to appropriate data type

    Invalid position values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.BP,
        col_type=ColType.BP,
        min_val=ColRange.BP_MIN,
        max_val=ColRange.BP_MAX,
        allow_na=ColAllowNA.BP,
        exclude_min=True,
        exclude_max=True,
    )

`munge_chr(df)` ¶

Munge chromosome column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with chromosome column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged chromosome column.

Notes

This function:

Removes rows with missing chromosome values
Converts chromosome to string and removes 'chr' prefix
Converts X chromosome to numeric value (23)
Validates chromosome values are within acceptable range
Converts to appropriate data type

Invalid chromosome values are removed and logged.

Source code in credtools/sumstats.py

def munge_chr(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge chromosome column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with chromosome column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged chromosome column.

    Notes
    -----
    This function:

    1. Removes rows with missing chromosome values
    2. Converts chromosome to string and removes 'chr' prefix
    3. Converts X chromosome to numeric value (23)
    4. Validates chromosome values are within acceptable range
    5. Converts to appropriate data type

    Invalid chromosome values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.CHR,
        col_type=ColType.CHR,
        min_val=ColRange.CHR_MIN,
        max_val=ColRange.CHR_MAX,
        allow_na=ColAllowNA.CHR,
        transform_func=_get_transform_chr(),
    )

`munge_eaf(df)` ¶

Munge effect allele frequency column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with effect allele frequency column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged effect allele frequency column.

Notes

This function:

Converts EAF values to numeric type
Removes rows with missing EAF values
Validates EAF values are within range [0, 1] (inclusive)
Converts to appropriate data type

Invalid EAF values are removed and logged.

Source code in credtools/sumstats.py

def munge_eaf(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge effect allele frequency column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with effect allele frequency column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged effect allele frequency column.

    Notes
    -----
    This function:

    1. Converts EAF values to numeric type
    2. Removes rows with missing EAF values
    3. Validates EAF values are within range [0, 1] (inclusive)
    4. Converts to appropriate data type

    Invalid EAF values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.EAF,
        col_type=ColType.EAF,
        min_val=ColRange.EAF_MIN,
        max_val=ColRange.EAF_MAX,
        allow_na=ColAllowNA.EAF,
    )

`munge_maf(df)` ¶

Munge minor allele frequency column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with minor allele frequency column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged minor allele frequency column.

Notes

This function:

Converts MAF values to numeric type
Removes rows with missing MAF values
Converts frequencies > 0.5 to 1 - frequency (to ensure minor allele)
Validates MAF values are within acceptable range
Converts to appropriate data type

Invalid MAF values are removed and logged.

Source code in credtools/sumstats.py

def munge_maf(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge minor allele frequency column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with minor allele frequency column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged minor allele frequency column.

    Notes
    -----
    This function:

    1. Converts MAF values to numeric type
    2. Removes rows with missing MAF values
    3. Converts frequencies > 0.5 to 1 - frequency (to ensure minor allele)
    4. Validates MAF values are within acceptable range
    5. Converts to appropriate data type

    Invalid MAF values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.MAF,
        col_type=ColType.MAF,
        min_val=ColRange.MAF_MIN,
        max_val=ColRange.MAF_MAX,
        allow_na=ColAllowNA.MAF,
        transform_func=_transform_maf,
    )

`munge_pvalue(df)` ¶

Munge p-value column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with p-value column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged p-value column.

Notes

This function:

Converts p-values to numeric type
Removes rows with missing p-values
Validates p-values are within acceptable range (exclusive: > 0, < 1)
Converts to appropriate data type

Invalid p-values are removed and logged.

Source code in credtools/sumstats.py

def munge_pvalue(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge p-value column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with p-value column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged p-value column.

    Notes
    -----
    This function:

    1. Converts p-values to numeric type
    2. Removes rows with missing p-values
    3. Validates p-values are within acceptable range (exclusive: > 0, < 1)
    4. Converts to appropriate data type

    Invalid p-values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.P,
        col_type=ColType.P,
        min_val=ColRange.P_MIN,
        max_val=ColRange.P_MAX,
        allow_na=ColAllowNA.P,
        exclude_min=True,
        exclude_max=True,
    )

`munge_rsid(df)` ¶

Munge rsID column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with rsID column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged rsID column.

Notes

This function converts the rsID column to the appropriate data type as defined in ColType.RSID.

Source code in credtools/sumstats.py

def munge_rsid(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge rsID column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with rsID column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged rsID column.

    Notes
    -----
    This function converts the rsID column to the appropriate data type
    as defined in ColType.RSID.
    """
    outdf = df.copy()
    outdf[ColName.RSID] = outdf[ColName.RSID].astype(ColType.RSID)
    return outdf

`munge_se(df)` ¶

Munge standard error column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with standard error column.	required

Returns:

Type	Description
`DataFrame`	DataFrame with munged standard error column.

Notes

This function:

Converts standard error values to numeric type
Removes rows with missing standard error values
Validates standard errors are positive (exclusive: > 0)
Converts to appropriate data type

Invalid standard error values are removed and logged.

Source code in credtools/sumstats.py

def munge_se(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge standard error column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with standard error column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged standard error column.

    Notes
    -----
    This function:

    1. Converts standard error values to numeric type
    2. Removes rows with missing standard error values
    3. Validates standard errors are positive (exclusive: > 0)
    4. Converts to appropriate data type

    Invalid standard error values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.SE,
        col_type=ColType.SE,
        min_val=ColRange.SE_MIN,
        allow_na=ColAllowNA.SE,
        exclude_min=True,
    )

`rm_col_allna(df)` ¶

Remove columns from the DataFrame that are entirely NA.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The DataFrame from which to remove columns.	required

Returns:

Type	Description
`DataFrame`	A DataFrame with columns that are entirely NA removed.

Notes

This function also converts empty strings to None before checking for all-NA columns. Columns that contain only missing values are dropped to reduce memory usage and improve processing efficiency.

Source code in credtools/sumstats.py

def rm_col_allna(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove columns from the DataFrame that are entirely NA.

    Parameters
    ----------
    df : pd.DataFrame
        The DataFrame from which to remove columns.

    Returns
    -------
    pd.DataFrame
        A DataFrame with columns that are entirely NA removed.

    Notes
    -----
    This function also converts empty strings to None before checking for
    all-NA columns. Columns that contain only missing values are dropped
    to reduce memory usage and improve processing efficiency.
    """
    outdf = df.copy()
    outdf = outdf.replace("", None)
    for col in outdf.columns:
        if outdf[col].isnull().all():
            logger.debug(f"Remove column {col} because it is all NA.")
            outdf.drop(col, axis=1, inplace=True)
    return outdf

`sort_alleles(df)` ¶

Sort EA and NEA in alphabetical order. Change the sign of beta if EA is not sorted as the first allele.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with allele columns.	required

Returns:

Type	Description
`DataFrame`	DataFrame with sorted allele columns.

Notes

This function ensures consistent allele ordering by:

Sorting effect allele (EA) and non-effect allele (NEA) alphabetically
Flipping the sign of beta if alleles were swapped
Adjusting effect allele frequency (EAF) if alleles were swapped (EAF = 1 - EAF)

This standardization is important for: - Consistent merging across datasets - Meta-analysis compatibility - LD matrix alignment

Source code in credtools/sumstats.py

def sort_alleles(df: pd.DataFrame) -> pd.DataFrame:
    """
    Sort EA and NEA in alphabetical order. Change the sign of beta if EA is not sorted as the first allele.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with allele columns.

    Returns
    -------
    pd.DataFrame
        DataFrame with sorted allele columns.

    Notes
    -----
    This function ensures consistent allele ordering by:

    1. Sorting effect allele (EA) and non-effect allele (NEA) alphabetically
    2. Flipping the sign of beta if alleles were swapped
    3. Adjusting effect allele frequency (EAF) if alleles were swapped (EAF = 1 - EAF)

    This standardization is important for:
    - Consistent merging across datasets
    - Meta-analysis compatibility
    - LD matrix alignment
    """
    outdf = df.copy()
    outdf[["sorted_a1", "sorted_a2"]] = np.sort(
        outdf[[ColName.EA, ColName.NEA]], axis=1
    )
    outdf[ColName.BETA] = np.where(
        outdf[ColName.EA] == outdf["sorted_a1"],
        outdf[ColName.BETA],
        -outdf[ColName.BETA],
    )
    if ColName.EAF in outdf.columns:
        outdf[ColName.EAF] = np.where(
            outdf[ColName.EA] == outdf["sorted_a1"],
            outdf[ColName.EAF],
            1 - outdf[ColName.EAF],
        )
    outdf[ColName.EA] = outdf["sorted_a1"]
    outdf[ColName.NEA] = outdf["sorted_a2"]
    outdf.drop(columns=["sorted_a1", "sorted_a2"], inplace=True)
    return outdf

Core Objects¶

Locus Objects¶

Locus(popu, cohort, sample_size, sumstats, locus_start, locus_end, ld=None, if_intersect=False) ¶

chrom property ¶

cohort property ¶

end property ¶

is_matched property ¶

locus_id property ¶

n_snps property ¶

original_sumstats property ¶

popu property ¶

prefix property ¶

sample_size property ¶

start property ¶

__repr__() ¶

copy() ¶

LocusSet(loci) ¶

chrom property ¶

end property ¶

locus_id property ¶

n_loci property ¶

start property ¶

__repr__() ¶

copy() ¶

check_loci_info(loci_info) ¶

intersect_sumstat_ld(locus) ¶

load_locus(prefix, popu, cohort, sample_size, locus_start, locus_end, if_intersect=False, calculate_lambda_s=False, **kwargs) ¶

load_locus_set(locus_info, if_intersect=False, calculate_lambda_s=False, **kwargs) ¶

LD Matrices¶

LDMatrix(map_df, r) ¶

__check_length() ¶

__repr__() ¶

copy() ¶

load_ld(ld_path, map_path, delimiter='\t', if_sort_alleles=True) ¶

load_ld_map(map_path, delimiter='\t') ¶

load_ld_matrix(file_path, delimiter='\t') ¶

read_lower_triangle(file_path, delimiter='\t') ¶

sort_alleles(ld) ¶

Summary Statistics¶

check_colnames(df) ¶

check_mandatory_cols(df) ¶

get_significant_snps(df, pvalue_threshold=5e-08, use_most_sig_if_no_sig=True) ¶

load_sumstats(filename, if_sort_alleles=True, sep=None, nrows=None, skiprows=0, comment=None, gzipped=None) ¶

make_SNPID_unique(sumstat, remove_duplicates=True, col_chr=ColName.CHR, col_bp=ColName.BP, col_ea=ColName.EA, col_nea=ColName.NEA, col_p=ColName.P) ¶

munge(df) ¶

munge_allele(df) ¶

munge_beta(df) ¶

munge_bp(df) ¶

munge_chr(df) ¶

munge_eaf(df) ¶

munge_maf(df) ¶

munge_pvalue(df) ¶

munge_rsid(df) ¶

munge_se(df) ¶

rm_col_allna(df) ¶

sort_alleles(df) ¶

`Locus(popu, cohort, sample_size, sumstats, locus_start, locus_end, ld=None, if_intersect=False)` ¶

`chrom` `property` ¶

`cohort` `property` ¶

`end` `property` ¶

`is_matched` `property` ¶

`locus_id` `property` ¶

`n_snps` `property` ¶

`original_sumstats` `property` ¶

`popu` `property` ¶

`prefix` `property` ¶

`sample_size` `property` ¶

`start` `property` ¶

`repr()` ¶

`copy()` ¶

`LocusSet(loci)` ¶

`chrom` `property` ¶

`end` `property` ¶

`locus_id` `property` ¶

`n_loci` `property` ¶

`start` `property` ¶

`repr()` ¶

`copy()` ¶

`check_loci_info(loci_info)` ¶

`intersect_sumstat_ld(locus)` ¶

`load_locus(prefix, popu, cohort, sample_size, locus_start, locus_end, if_intersect=False, calculate_lambda_s=False, **kwargs)` ¶

`load_locus_set(locus_info, if_intersect=False, calculate_lambda_s=False, **kwargs)` ¶

`LDMatrix(map_df, r)` ¶

`__check_length()` ¶

`repr()` ¶

`copy()` ¶

`load_ld(ld_path, map_path, delimiter='\t', if_sort_alleles=True)` ¶

`load_ld_map(map_path, delimiter='\t')` ¶

`load_ld_matrix(file_path, delimiter='\t')` ¶

`read_lower_triangle(file_path, delimiter='\t')` ¶

`sort_alleles(ld)` ¶

`check_colnames(df)` ¶

`check_mandatory_cols(df)` ¶

`get_significant_snps(df, pvalue_threshold=5e-08, use_most_sig_if_no_sig=True)` ¶

`load_sumstats(filename, if_sort_alleles=True, sep=None, nrows=None, skiprows=0, comment=None, gzipped=None)` ¶

`make_SNPID_unique(sumstat, remove_duplicates=True, col_chr=ColName.CHR, col_bp=ColName.BP, col_ea=ColName.EA, col_nea=ColName.NEA, col_p=ColName.P)` ¶

`munge(df)` ¶

`munge_allele(df)` ¶

`munge_beta(df)` ¶

`munge_bp(df)` ¶

`munge_chr(df)` ¶

`munge_eaf(df)` ¶

`munge_maf(df)` ¶

`munge_pvalue(df)` ¶

`munge_rsid(df)` ¶

`munge_se(df)` ¶

`rm_col_allna(df)` ¶

`sort_alleles(df)` ¶