Skip to content

Core Objects

Use these objects when you want to load data yourself and call CREDTOOLS from Python.

Locus Objects

Class for the input data of the fine-mapping analysis.

Locus(popu, cohort, sample_size, sumstats, locus_start, locus_end, ld=None, if_intersect=False)

Locus class to represent a genomic locus with associated summary statistics and linkage disequilibrium (LD) matrix.

Parameters:

Name Type Description Default
popu str

Population code. e.g. "EUR". Choose from ["AFR", "AMR", "EAS", "EUR", "SAS"].

required
cohort str

Cohort name.

required
sample_size int

Sample size.

required
sumstats DataFrame

Summary statistics DataFrame.

required
ld Optional[LDMatrix]

LD matrix, by default None.

None
if_intersect bool

Whether to intersect the LD matrix and summary statistics file, by default False.

False

Attributes:

Name Type Description
original_sumstats DataFrame

The original summary statistics file.

sumstats DataFrame

The processed summary statistics file.

ld LDMatrix

The LD matrix object.

chrom int

Chromosome.

start int

Start position of the locus.

end int

End position of the locus.

n_snps int

Number of SNPs in the locus.

prefix str

The prefix combining population and cohort.

locus_id str

Unique identifier for the locus.

is_matched bool

Whether the LD matrix and summary statistics file are matched.

lambda_s Optional[float]

The estimated lambda_s parameter from estimate_s_rss function, None if not calculated.

Notes

If no LD matrix is provided, only ABF method can be used for fine-mapping.

Parameters:

Name Type Description Default
popu str

Population code. e.g. "EUR". Choose from ["AFR", "AMR", "EAS", "EUR", "SAS"].

required
cohort str

Cohort name.

required
sample_size int

Sample size.

required
sumstats DataFrame

Summary statistics DataFrame.

required
locus_start int

Fixed start position for the locus.

required
locus_end int

Fixed end position for the locus.

required
ld Optional[LDMatrix]

LD matrix, by default None.

None
if_intersect bool

Whether to intersect the LD matrix and summary statistics file, by default False.

False
Warnings

If no LD matrix is provided, a warning is logged that only ABF method can be used.

Source code in credtools/locus.py
def __init__(
    self,
    popu: str,
    cohort: str,
    sample_size: int,
    sumstats: pd.DataFrame,
    locus_start: int,
    locus_end: int,
    ld: Optional[LDMatrix] = None,
    if_intersect: bool = False,
) -> None:
    """
    Initialize the Locus object.

    Parameters
    ----------
    popu : str
        Population code. e.g. "EUR". Choose from ["AFR", "AMR", "EAS", "EUR", "SAS"].
    cohort : str
        Cohort name.
    sample_size : int
        Sample size.
    sumstats : pd.DataFrame
        Summary statistics DataFrame.
    locus_start : int
        Fixed start position for the locus.
    locus_end : int
        Fixed end position for the locus.
    ld : Optional[LDMatrix], optional
        LD matrix, by default None.
    if_intersect : bool, optional
        Whether to intersect the LD matrix and summary statistics file, by default False.

    Warnings
    --------
    If no LD matrix is provided, a warning is logged that only ABF method can be used.
    """
    self.sumstats = sumstats
    self._original_sumstats = self.sumstats.copy()
    self._popu = popu
    self._cohort = cohort
    self._sample_size = sample_size
    self._locus_start = locus_start
    self._locus_end = locus_end
    self.lambda_s = None
    if ld:
        self.ld = ld
        if if_intersect:
            inters = intersect_sumstat_ld(self)
            self.sumstats = inters.sumstats
            self.ld = inters.ld
    else:
        logger.warning("LD matrix and map file not found. Can only run ABF method.")
        self.ld = LDMatrix(pd.DataFrame(), np.array([]))

chrom property

Get the chromosome.

cohort property

Get the cohort name.

end property

Get the end position.

is_matched property

Check if the LD matrix and sumstats file are matched.

locus_id property

Get the locus ID.

n_snps property

Get the number of SNPs.

original_sumstats property

Get the original sumstats file.

popu property

Get the population code.

prefix property

Get the prefix of the locus.

sample_size property

Get the sample size.

start property

Get the start position.

__repr__()

Return a string representation of the Locus object.

Returns:

Type Description
str

String representation of the Locus object.

Source code in credtools/locus.py
def __repr__(self) -> str:
    """
    Return a string representation of the Locus object.

    Returns
    -------
    str
        String representation of the Locus object.
    """
    return f"Locus(popu={self.popu}, cohort={self.cohort}, sample_size={self.sample_size}, chr={self.chrom}, start={self.start}, end={self.end}, sumstats={self.sumstats.shape}, ld={self.ld.r.shape})"

copy()

Copy the Locus object.

Returns:

Type Description
Locus

A copy of the Locus object.

Source code in credtools/locus.py
def copy(self) -> "Locus":
    """
    Copy the Locus object.

    Returns
    -------
    Locus
        A copy of the Locus object.
    """
    new_locus = Locus(
        self.popu,
        self.cohort,
        self.sample_size,
        self.sumstats.copy(),
        self._locus_start,
        self._locus_end,
        self.ld.copy(),
        if_intersect=False,
    )
    new_locus.lambda_s = self.lambda_s
    return new_locus

LocusSet(loci)

LocusSet class to represent a set of genomic loci.

Parameters:

Name Type Description Default
loci List[Locus]

List of Locus objects.

required

Attributes:

Name Type Description
loci List[Locus]

List of Locus objects.

n_loci int

Number of loci.

chrom int

Chromosome number.

start int

Start position of the locus.

end int

End position of the locus.

locus_id str

Unique identifier for the locus.

Raises:

Type Description
ValueError

If the chromosomes of the loci are not the same.

Parameters:

Name Type Description Default
loci List[Locus]

List of Locus objects.

required
Source code in credtools/locus.py
def __init__(self, loci: List[Locus]) -> None:
    """
    Initialize the LocusSet object.

    Parameters
    ----------
    loci : List[Locus]
        List of Locus objects.
    """
    self.loci = loci

chrom property

Get the chromosome.

Returns:

Type Description
int

Chromosome number.

Raises:

Type Description
ValueError

If the chromosomes of the loci are not the same.

end property

Get the end position.

locus_id property

Get the locus ID.

n_loci property

Get the number of loci.

start property

Get the start position.

__repr__()

Return a string representation of the LocusSet object.

Returns:

Type Description
str

String representation of the LocusSet object.

Source code in credtools/locus.py
def __repr__(self) -> str:
    """
    Return a string representation of the LocusSet object.

    Returns
    -------
    str
        String representation of the LocusSet object.
    """
    return (
        f"LocusSet(\n n_loci={len(self.loci)}, chrom={self.chrom}, start={self.start}, end={self.end}, locus_id={self.locus_id} \n"
        + "\n".join([locus.__repr__() for locus in self.loci])
        + "\n"
        + ")"
    )

copy()

Copy the LocusSet object.

Returns:

Type Description
LocusSet

A copy of the LocusSet object.

Source code in credtools/locus.py
def copy(self) -> "LocusSet":
    """
    Copy the LocusSet object.

    Returns
    -------
    LocusSet
        A copy of the LocusSet object.
    """
    return LocusSet([locus.copy() for locus in self.loci])

check_loci_info(loci_info)

Check and validate loci information DataFrame.

Parameters:

Name Type Description Default
loci_info DataFrame

DataFrame containing loci information.

required

Returns:

Type Description
DataFrame

Validated and type-corrected loci_info DataFrame.

Raises:

Type Description
ValueError

If required columns are missing, data types are incorrect, or locus_id/boundary consistency checks fail.

Notes

This function performs the following checks: 1. Ensures all required columns are present 2. Validates and converts data types 3. Checks that loci with same locus_id have same chr, start, end 4. Validates chromosome, start, and end values

Source code in credtools/locus.py
def check_loci_info(loci_info: pd.DataFrame) -> pd.DataFrame:
    """
    Check and validate loci information DataFrame.

    Parameters
    ----------
    loci_info : pd.DataFrame
        DataFrame containing loci information.

    Returns
    -------
    pd.DataFrame
        Validated and type-corrected loci_info DataFrame.

    Raises
    ------
    ValueError
        If required columns are missing, data types are incorrect,
        or locus_id/boundary consistency checks fail.

    Notes
    -----
    This function performs the following checks:
    1. Ensures all required columns are present
    2. Validates and converts data types
    3. Checks that loci with same locus_id have same chr, start, end
    4. Validates chromosome, start, and end values
    """
    loci_info = loci_info.copy()

    # Check for required columns
    required_cols = [
        "prefix",
        "popu",
        "cohort",
        "sample_size",
        "chr",
        "start",
        "end",
        "locus_id",
    ]
    missing_cols = [col for col in required_cols if col not in loci_info.columns]
    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}")

    # Type checking and conversion
    try:
        # Convert numeric columns
        loci_info["sample_size"] = loci_info["sample_size"].astype(int)
        loci_info["chr"] = loci_info["chr"].astype(int)
        loci_info["start"] = loci_info["start"].astype(int)
        loci_info["end"] = loci_info["end"].astype(int)

        # Ensure string columns are strings
        loci_info["prefix"] = loci_info["prefix"].astype(str)
        loci_info["popu"] = loci_info["popu"].astype(str)
        loci_info["cohort"] = loci_info["cohort"].astype(str)
        loci_info["locus_id"] = loci_info["locus_id"].astype(str)

    except (ValueError, TypeError) as e:
        raise ValueError(f"Failed to convert data types: {e}")

    # Validate values
    if (loci_info["sample_size"] <= 0).any():
        raise ValueError("Sample size must be positive")

    if (loci_info["chr"] <= 0).any() or (loci_info["chr"] > 25).any():
        raise ValueError("Chromosome must be between 1 and 25")

    if (loci_info["start"] <= 0).any():
        raise ValueError("Start position must be positive")

    if (loci_info["end"] <= loci_info["start"]).any():
        raise ValueError("End position must be greater than start position")

    # Check for duplicates in popu+cohort+locus_id combination
    if loci_info.duplicated(subset=["popu", "cohort", "locus_id"]).any():
        raise ValueError("Each popu+cohort+locus_id combination must be unique")

    # Check consistency: same locus_id must have same chr, start, end
    locus_boundaries = loci_info.groupby("locus_id")[["chr", "start", "end"]].nunique()
    inconsistent_loci = locus_boundaries[(locus_boundaries > 1).any(axis=1)]

    if not inconsistent_loci.empty:
        raise ValueError(
            f"Inconsistent boundaries for locus_id(s): {inconsistent_loci.index.tolist()}. "
            "Each locus_id must have consistent chr, start, end values across all rows."
        )

    return loci_info

intersect_sumstat_ld(locus)

Intersect the Variant IDs in the LD matrix and the sumstats file.

Parameters:

Name Type Description Default
locus Locus

Locus object containing LD matrix and summary statistics.

required

Returns:

Type Description
Locus

Locus object containing the intersected LD matrix and sumstats file.

Raises:

Type Description
ValueError

If LD matrix not found or no common Variant IDs found between the LD matrix and the sumstats file.

Warnings

If only a few common Variant IDs are found (≤ 10), a warning is logged.

Notes

This function performs the following operations:

  1. Checks if LD matrix and summary statistics are already matched
  2. Finds common SNP IDs between LD matrix and summary statistics
  3. Subsets both datasets to common variants
  4. Reorders data to maintain consistency
  5. Returns a new Locus object with intersected data
Source code in credtools/locus.py
def intersect_sumstat_ld(locus: Locus) -> Locus:
    """
    Intersect the Variant IDs in the LD matrix and the sumstats file.

    Parameters
    ----------
    locus : Locus
        Locus object containing LD matrix and summary statistics.

    Returns
    -------
    Locus
        Locus object containing the intersected LD matrix and sumstats file.

    Raises
    ------
    ValueError
        If LD matrix not found or no common Variant IDs found between the LD matrix and the sumstats file.

    Warnings
    --------
    If only a few common Variant IDs are found (≤ 10), a warning is logged.

    Notes
    -----
    This function performs the following operations:

    1. Checks if LD matrix and summary statistics are already matched
    2. Finds common SNP IDs between LD matrix and summary statistics
    3. Subsets both datasets to common variants
    4. Reorders data to maintain consistency
    5. Returns a new Locus object with intersected data
    """
    if locus.ld is None:
        raise ValueError("LD matrix not found.")
    if locus.is_matched:
        logger.info("The LD matrix and sumstats file are matched.")
        return locus
    ldmap = locus.ld.map.copy()
    r = locus.ld.r.copy()
    sumstats = locus.sumstats.copy()
    sumstats = sumstats.sort_values([ColName.CHR, ColName.BP], ignore_index=True)
    intersec_sumstats = sumstats[
        sumstats[ColName.SNPID].isin(ldmap[ColName.SNPID])
    ].copy()
    intersec_variants = intersec_sumstats[ColName.SNPID].to_numpy()
    if len(intersec_variants) == 0:
        raise ValueError(
            f"No common Variant IDs found between the LD matrix and the sumstats file for locus {locus.locus_id}."
        )
    elif len(intersec_variants) <= 10:
        logger.warning(
            f"Only a few common Variant IDs found between the LD matrix and the sumstats file(<= 10) for locus {locus.locus_id}."
        )
    ldmap["idx"] = ldmap.index
    ldmap.set_index(ColName.SNPID, inplace=True, drop=False)
    ldmap = ldmap.loc[intersec_variants].copy()
    intersec_index = ldmap["idx"].to_numpy()
    r = r[intersec_index, :][:, intersec_index]
    intersec_sumstats.reset_index(drop=True, inplace=True)
    ldmap.drop("idx", axis=1, inplace=True)
    ldmap = ldmap.reset_index(drop=True)
    intersec_ld = LDMatrix(ldmap, r)
    logger.info(
        "Intersected the Variant IDs in the LD matrix and the sumstats file. "
        f"Number of common Variant IDs: {len(intersec_index)}"
    )
    return Locus(
        locus.popu,
        locus.cohort,
        locus.sample_size,
        intersec_sumstats,
        locus._locus_start,
        locus._locus_end,
        intersec_ld,
        if_intersect=False,
    )

load_locus(prefix, popu, cohort, sample_size, locus_start, locus_end, if_intersect=False, calculate_lambda_s=False, **kwargs)

Load the input data of the fine-mapping analysis.

Parameters:

Name Type Description Default
prefix str

Prefix of the input files.

required
popu str

Population of the input data.

required
cohort str

Cohort of the input data.

required
sample_size int

Sample size of the input data.

required
locus_start int

Fixed start position for the locus.

required
locus_end int

Fixed end position for the locus.

required
if_intersect bool

Whether to intersect the input data with the LD matrix, by default False.

False
calculate_lambda_s bool

Whether to calculate lambda_s parameter using estimate_s_rss function, by default False.

False
**kwargs Any

Additional keyword arguments passed to loading functions.

{}

Returns:

Type Description
Locus

Locus object containing the input data.

Raises:

Type Description
ValueError

If the required input files are not found.

Notes

The function looks for files with the following patterns:

  • Summary statistics: {prefix}.sumstat or {prefix}.sumstats.gz
  • LD matrix: {prefix}.ld or {prefix}.ld.npz
  • LD map: {prefix}.ldmap or {prefix}.ldmap.gz

All files are required for proper functioning.

Examples:

>>> locus = load_locus('EUR_study1', 'EUR', 'study1', 50000)
>>> print(f"Loaded locus with {locus.n_snps} SNPs")
Loaded locus with 10000 SNPs
Source code in credtools/locus.py
def load_locus(
    prefix: str,
    popu: str,
    cohort: str,
    sample_size: int,
    locus_start: int,
    locus_end: int,
    if_intersect: bool = False,
    calculate_lambda_s: bool = False,
    **kwargs: Any,
) -> Locus:
    """
    Load the input data of the fine-mapping analysis.

    Parameters
    ----------
    prefix : str
        Prefix of the input files.
    popu : str
        Population of the input data.
    cohort : str
        Cohort of the input data.
    sample_size : int
        Sample size of the input data.
    locus_start : int
        Fixed start position for the locus.
    locus_end : int
        Fixed end position for the locus.
    if_intersect : bool, optional
        Whether to intersect the input data with the LD matrix, by default False.
    calculate_lambda_s : bool, optional
        Whether to calculate lambda_s parameter using estimate_s_rss function, by default False.
    **kwargs : Any
        Additional keyword arguments passed to loading functions.

    Returns
    -------
    Locus
        Locus object containing the input data.

    Raises
    ------
    ValueError
        If the required input files are not found.

    Notes
    -----
    The function looks for files with the following patterns:

    - Summary statistics: {prefix}.sumstat or {prefix}.sumstats.gz
    - LD matrix: {prefix}.ld or {prefix}.ld.npz
    - LD map: {prefix}.ldmap or {prefix}.ldmap.gz

    All files are required for proper functioning.

    Examples
    --------
    >>> locus = load_locus('EUR_study1', 'EUR', 'study1', 50000)
    >>> print(f"Loaded locus with {locus.n_snps} SNPs")
    Loaded locus with 10000 SNPs
    """
    if os.path.exists(f"{prefix}.sumstat"):
        sumstats_path = f"{prefix}.sumstat"
    elif os.path.exists(f"{prefix}.sumstats.gz"):
        sumstats_path = f"{prefix}.sumstats.gz"
    else:
        raise ValueError("Sumstats file not found.")

    sumstats = load_sumstats(sumstats_path, if_sort_alleles=True, **kwargs)
    if os.path.exists(f"{prefix}.ld"):
        ld_path = f"{prefix}.ld"
    elif os.path.exists(f"{prefix}.ld.npz"):
        ld_path = f"{prefix}.ld.npz"
    else:
        raise ValueError("LD matrix file not found.")
    if os.path.exists(f"{prefix}.ldmap"):
        ldmap_path = f"{prefix}.ldmap"
    elif os.path.exists(f"{prefix}.ldmap.gz"):
        ldmap_path = f"{prefix}.ldmap.gz"
    else:
        raise ValueError("LD map file not found.")
    ld = load_ld(ld_path, ldmap_path, if_sort_alleles=True, **kwargs)

    locus = Locus(
        popu,
        cohort,
        sample_size,
        sumstats,
        locus_start,
        locus_end,
        ld=ld,
        if_intersect=if_intersect,
    )

    if calculate_lambda_s:
        try:
            # Import here to avoid circular imports
            from credtools.qc import estimate_s_rss

            locus.lambda_s = estimate_s_rss(locus)
            logger.info(
                f"Calculated lambda_s for locus {locus.locus_id}: {locus.lambda_s}"
            )
        except Exception as e:
            logger.warning(
                f"Failed to calculate lambda_s for locus {locus.locus_id}: {e}"
            )
            locus.lambda_s = None

    return locus

load_locus_set(locus_info, if_intersect=False, calculate_lambda_s=False, **kwargs)

Load the input data of the fine-mapping analysis for multiple loci.

Parameters:

Name Type Description Default
locus_info DataFrame

DataFrame containing the locus information with required columns: ['prefix', 'popu', 'cohort', 'sample_size', 'chr', 'start', 'end', 'locus_id'].

required
if_intersect bool

Whether to intersect the input data with the LD matrix, by default False.

False
calculate_lambda_s bool

Whether to calculate lambda_s parameter using estimate_s_rss function, by default False.

False
**kwargs Any

Additional keyword arguments passed to load_locus function.

{}

Returns:

Type Description
LocusSet

LocusSet object containing the input data.

Raises:

Type Description
ValueError

If required columns are missing or if the combination of popu and cohort is not unique.

Notes

The locus_info DataFrame must contain the following columns:

  • prefix: File prefix for each locus
  • popu: Population code
  • cohort: Cohort name
  • sample_size: Sample size for the cohort
  • chr: Chromosome number
  • start: Start position of the locus
  • end: End position of the locus
  • locus_id: Locus identifier

All rows must have the same chr, start, end, locus_id values (representing the same locus).

Examples:

>>> locus_info = pd.DataFrame({
...     'prefix': ['EUR_study1', 'ASN_study2'],
...     'popu': ['EUR', 'ASN'],
...     'cohort': ['study1', 'study2'],
...     'sample_size': [50000, 30000]
... })
>>> locus_set = load_locus_set(locus_info)
>>> print(f"Loaded {locus_set.n_loci} loci")
Loaded 2 loci
Source code in credtools/locus.py
def load_locus_set(
    locus_info: pd.DataFrame,
    if_intersect: bool = False,
    calculate_lambda_s: bool = False,
    **kwargs: Any,
) -> LocusSet:
    """
    Load the input data of the fine-mapping analysis for multiple loci.

    Parameters
    ----------
    locus_info : pd.DataFrame
        DataFrame containing the locus information with required columns:
        ['prefix', 'popu', 'cohort', 'sample_size', 'chr', 'start', 'end', 'locus_id'].
    if_intersect : bool, optional
        Whether to intersect the input data with the LD matrix, by default False.
    calculate_lambda_s : bool, optional
        Whether to calculate lambda_s parameter using estimate_s_rss function, by default False.
    **kwargs : Any
        Additional keyword arguments passed to load_locus function.

    Returns
    -------
    LocusSet
        LocusSet object containing the input data.

    Raises
    ------
    ValueError
        If required columns are missing or if the combination of popu and cohort is not unique.

    Notes
    -----
    The locus_info DataFrame must contain the following columns:

    - prefix: File prefix for each locus
    - popu: Population code
    - cohort: Cohort name
    - sample_size: Sample size for the cohort
    - chr: Chromosome number
    - start: Start position of the locus
    - end: End position of the locus
    - locus_id: Locus identifier

    All rows must have the same chr, start, end, locus_id values (representing the same locus).

    Examples
    --------
    >>> locus_info = pd.DataFrame({
    ...     'prefix': ['EUR_study1', 'ASN_study2'],
    ...     'popu': ['EUR', 'ASN'],
    ...     'cohort': ['study1', 'study2'],
    ...     'sample_size': [50000, 30000]
    ... })
    >>> locus_set = load_locus_set(locus_info)
    >>> print(f"Loaded {locus_set.n_loci} loci")
    Loaded 2 loci
    """
    # Check and validate the locus_info DataFrame
    locus_info = check_loci_info(locus_info)

    # Check that all rows have the same chr, start, end (same locus)
    if len(locus_info["chr"].unique()) > 1:
        raise ValueError("All rows must have the same chromosome")
    if len(locus_info["start"].unique()) > 1:
        raise ValueError("All rows must have the same start position")
    if len(locus_info["end"].unique()) > 1:
        raise ValueError("All rows must have the same end position")
    if len(locus_info["locus_id"].unique()) > 1:
        raise ValueError("All rows must have the same locus_id")

    # Additional check for load_locus_set: popu+cohort must be unique within this single locus
    if locus_info.duplicated(subset=["popu", "cohort"]).any():
        raise ValueError(
            "Each popu+cohort combination must be unique within a single locus"
        )

    loci = []
    for i, row in locus_info.iterrows():
        loci.append(
            load_locus(
                row["prefix"],
                row["popu"],
                row["cohort"],
                row["sample_size"],
                int(row["start"]),
                int(row["end"]),
                if_intersect,
                calculate_lambda_s,
                **kwargs,
            )
        )
    return LocusSet(loci)

LD Matrices

Functions for reading and converting lower triangle matrices.

LDMatrix(map_df, r)

Class to store the LD matrix and the corresponding Variant IDs.

Parameters:

Name Type Description Default
map_df DataFrame

DataFrame containing the Variant IDs.

required
r ndarray

LD matrix.

required

Attributes:

Name Type Description
map DataFrame

DataFrame containing the Variant IDs.

r ndarray

LD matrix.

Raises:

Type Description
ValueError

If the number of rows in the map file does not match the number of rows in the LD matrix.

Parameters:

Name Type Description Default
map_df DataFrame

DataFrame containing the Variant IDs.

required
r ndarray

LD matrix.

required

Raises:

Type Description
ValueError

If the number of rows in the map file does not match the number of rows in the LD matrix.

Source code in credtools/ldmatrix.py
def __init__(self, map_df: pd.DataFrame, r: np.ndarray) -> None:
    """
    Initialize the LDMatrix object.

    Parameters
    ----------
    map_df : pd.DataFrame
        DataFrame containing the Variant IDs.
    r : np.ndarray
        LD matrix.

    Raises
    ------
    ValueError
        If the number of rows in the map file does not match the number of rows in the LD matrix.
    """
    self.map = map_df
    self.r = r
    self.__check_length()

__check_length()

Check if the number of rows in the map file matches the number of rows in the LD matrix.

Raises:

Type Description
ValueError

If the number of rows in the map file does not match the number of rows in the LD matrix.

Source code in credtools/ldmatrix.py
def __check_length(self) -> None:
    """
    Check if the number of rows in the map file matches the number of rows in the LD matrix.

    Raises
    ------
    ValueError
        If the number of rows in the map file does not match the number of rows in the LD matrix.
    """
    if len(self.map) != len(self.r):
        raise ValueError(
            "The number of rows in the map file does not match the number of rows in the LD matrix."
        )

__repr__()

Return a string representation of the LDMatrix object.

Returns:

Type Description
str

String representation showing the shapes of map and r.

Source code in credtools/ldmatrix.py
def __repr__(self) -> str:
    """
    Return a string representation of the LDMatrix object.

    Returns
    -------
    str
        String representation showing the shapes of map and r.
    """
    return f"LDMatrix(map={self.map.shape}, r={self.r.shape})"

copy()

Return a copy of the LDMatrix object.

Returns:

Type Description
LDMatrix

A copy of the LDMatrix object.

Source code in credtools/ldmatrix.py
def copy(self) -> "LDMatrix":
    """
    Return a copy of the LDMatrix object.

    Returns
    -------
    LDMatrix
        A copy of the LDMatrix object.
    """
    return LDMatrix(self.map.copy(), self.r.copy())

load_ld(ld_path, map_path, delimiter='\t', if_sort_alleles=True)

Read LD matrices and Variant IDs from files. Pair each matrix with its corresponding Variant IDs.

Parameters:

Name Type Description Default
ld_path str

Path to the input text file containing the lower triangle matrix or .npz file.

required
map_path str

Path to the input text file containing the Variant IDs.

required
delimiter str

Delimiter used in the input file, by default "\t".

'\t'
if_sort_alleles bool

Sort alleles in the LD map in alphabetical order and change the sign of the LD matrix if the alleles are swapped, by default True.

True

Returns:

Type Description
LDMatrix

Object containing the LD matrix and the Variant IDs.

Raises:

Type Description
ValueError

If the number of variants in the map file does not match the number of rows in the LD matrix.

Notes

Future enhancements planned:

  • Support for npz files (partially implemented)
  • Support for plink bin4 format
  • Support for ldstore bcor format

The function validates that the LD matrix and map file have consistent dimensions and optionally sorts alleles for consistent representation.

Examples:

>>> ld_matrix = load_ld('data.ld', 'data.ldmap')
>>> print(f"Loaded LD matrix with {ld_matrix.r.shape[0]} variants")
Loaded LD matrix with 1000 variants
Source code in credtools/ldmatrix.py
def load_ld(
    ld_path: str, map_path: str, delimiter: str = "\t", if_sort_alleles: bool = True
) -> LDMatrix:
    r"""
    Read LD matrices and Variant IDs from files. Pair each matrix with its corresponding Variant IDs.

    Parameters
    ----------
    ld_path : str
        Path to the input text file containing the lower triangle matrix or .npz file.
    map_path : str
        Path to the input text file containing the Variant IDs.
    delimiter : str, optional
        Delimiter used in the input file, by default "\t".
    if_sort_alleles : bool, optional
        Sort alleles in the LD map in alphabetical order and change the sign of the
        LD matrix if the alleles are swapped, by default True.

    Returns
    -------
    LDMatrix
        Object containing the LD matrix and the Variant IDs.

    Raises
    ------
    ValueError
        If the number of variants in the map file does not match the number of rows in the LD matrix.

    Notes
    -----
    Future enhancements planned:

    - Support for npz files (partially implemented)
    - Support for plink bin4 format
    - Support for ldstore bcor format

    The function validates that the LD matrix and map file have consistent dimensions
    and optionally sorts alleles for consistent representation.

    Examples
    --------
    >>> ld_matrix = load_ld('data.ld', 'data.ldmap')
    >>> print(f"Loaded LD matrix with {ld_matrix.r.shape[0]} variants")
    Loaded LD matrix with 1000 variants
    """
    ld_df = load_ld_matrix(ld_path, delimiter)
    logger.info(f"Loaded LD matrix with shape {ld_df.shape} from '{ld_path}'.")
    map_df = load_ld_map(map_path, delimiter)
    logger.info(f"Loaded map file with shape {map_df.shape} from '{map_path}'.")
    if ld_df.shape[0] != map_df.shape[0]:
        raise ValueError(
            "The number of variants in the map file does not match the number of rows in the LD matrix.\n"
            f"Number of variants in the map file: {map_df.shape[0]}, number of rows in the LD matrix: {ld_df.shape[0]}"
            f"ld_path: {ld_path}, map_path: {map_path}"
        )
    ld = LDMatrix(map_df, ld_df)
    if if_sort_alleles:
        ld = sort_alleles(ld)

    return ld

load_ld_map(map_path, delimiter='\t')

Read Variant IDs from a file.

Parameters:

Name Type Description Default
map_path str

Path to the input text file containing the Variant IDs.

required
delimiter str

Delimiter used in the input file, by default "\t".

'\t'

Returns:

Type Description
DataFrame

DataFrame containing the Variant IDs with columns CHR, BP, A1, A2, and SNPID.

Raises:

Type Description
ValueError

If the input file is empty or does not contain the required columns.

Notes

This function assumes that the input file contains the required columns:

  • Chromosome (CHR)
  • Base pair position (BP)
  • Allele 1 (A1)
  • Allele 2 (A2)

The function performs data cleaning including:

  • Converting chromosome and position to appropriate types
  • Validating alleles are valid DNA bases (A, C, G, T)
  • Removing variants where A1 == A2
  • Creating unique SNPID identifiers

Examples:

>>> # Create sample map file
>>> contents = "CHR\\tBP\\tA1\\tA2\\n1\\t1000\\tA\\tG\\n1\\t2000\\tC\\tT\\n2\\t3000\\tT\\tC"
>>> with open('map.txt', 'w') as file:
...     file.write(contents)
>>> df = load_ld_map('map.txt')
>>> print(df)
    SNPID       CHR    BP A1 A2
0   1-1000-A-G    1  1000  A  G
1   1-2000-C-T    1  2000  C  T
2   2-3000-C-T    2  3000  T  C
Source code in credtools/ldmatrix.py
def load_ld_map(map_path: str, delimiter: str = "\t") -> pd.DataFrame:
    r"""
    Read Variant IDs from a file.

    Parameters
    ----------
    map_path : str
        Path to the input text file containing the Variant IDs.
    delimiter : str, optional
        Delimiter used in the input file, by default "\t".

    Returns
    -------
    pd.DataFrame
        DataFrame containing the Variant IDs with columns CHR, BP, A1, A2, and SNPID.

    Raises
    ------
    ValueError
        If the input file is empty or does not contain the required columns.

    Notes
    -----
    This function assumes that the input file contains the required columns:

    - Chromosome (CHR)
    - Base pair position (BP)
    - Allele 1 (A1)
    - Allele 2 (A2)

    The function performs data cleaning including:

    - Converting chromosome and position to appropriate types
    - Validating alleles are valid DNA bases (A, C, G, T)
    - Removing variants where A1 == A2
    - Creating unique SNPID identifiers

    Examples
    --------
    >>> # Create sample map file
    >>> contents = "CHR\\tBP\\tA1\\tA2\\n1\\t1000\\tA\\tG\\n1\\t2000\\tC\\tT\\n2\\t3000\\tT\\tC"
    >>> with open('map.txt', 'w') as file:
    ...     file.write(contents)
    >>> df = load_ld_map('map.txt')
    >>> print(df)
        SNPID       CHR    BP A1 A2
    0   1-1000-A-G    1  1000  A  G
    1   1-2000-C-T    1  2000  C  T
    2   2-3000-C-T    2  3000  T  C
    """
    # TODO: use REF/ALT instead of A1/A2
    map_df = pd.read_csv(map_path, sep=delimiter)
    missing_cols = [col for col in ColName.map_cols if col not in map_df.columns]
    if missing_cols:
        raise ValueError(f"Missing columns in the input file: {missing_cols}")
    outdf = munge_chr(map_df)
    outdf = munge_bp(outdf)
    for col in [ColName.A1, ColName.A2]:
        pre_n = outdf.shape[0]
        outdf = outdf[outdf[col].notnull()]
        outdf[col] = outdf[col].astype(str).str.upper()
        outdf = outdf[outdf[col].str.match(r"^[ACGT]+$")]
        after_n = outdf.shape[0]
        logger.debug(f"Remove {pre_n - after_n} rows because of invalid {col}.")
    outdf = outdf[outdf[ColName.A1] != outdf[ColName.A2]]
    outdf = make_SNPID_unique(
        outdf, col_ea=ColName.A1, col_nea=ColName.A2, remove_duplicates=False
    )
    outdf.reset_index(drop=True, inplace=True)
    # TODO: check if allele frequency is available
    return outdf

load_ld_matrix(file_path, delimiter='\t')

Convert a lower triangle matrix from a file to a symmetric square matrix.

Parameters:

Name Type Description Default
file_path str

Path to the input text file containing the lower triangle matrix.

required
delimiter str

Delimiter used in the input file, by default "\t".

'\t'

Returns:

Type Description
ndarray

Symmetric square matrix with diagonal filled with 1.

Raises:

Type Description
ValueError

If the input file is empty or does not contain a valid lower triangle matrix.

FileNotFoundError

If the specified file does not exist.

Notes

This function assumes that the input file contains a valid lower triangle matrix with each row on a new line and elements separated by the specified delimiter. For .npz files, it loads the first array key in the file.

Examples:

>>> # Assuming 'lower_triangle.txt' contains:
>>> # 1.0
>>> # 0.1 1.0
>>> # 0.2 0.4 1.0
>>> # 0.3 0.5 0.6 1.0
>>> matrix = load_ld_matrix('lower_triangle.txt')
>>> print(matrix)
array([[1.  , 0.1 , 0.2 , 0.3 ],
        [0.1 , 1.  , 0.4 , 0.5 ],
        [0.2 , 0.4 , 1.  , 0.6 ],
        [0.3 , 0.5 , 0.6 , 1.  ]])
Source code in credtools/ldmatrix.py
def load_ld_matrix(file_path: str, delimiter: str = "\t") -> np.ndarray:
    r"""
    Convert a lower triangle matrix from a file to a symmetric square matrix.

    Parameters
    ----------
    file_path : str
        Path to the input text file containing the lower triangle matrix.
    delimiter : str, optional
        Delimiter used in the input file, by default "\t".

    Returns
    -------
    np.ndarray
        Symmetric square matrix with diagonal filled with 1.

    Raises
    ------
    ValueError
        If the input file is empty or does not contain a valid lower triangle matrix.
    FileNotFoundError
        If the specified file does not exist.

    Notes
    -----
    This function assumes that the input file contains a valid lower triangle matrix
    with each row on a new line and elements separated by the specified delimiter.
    For .npz files, it loads the first array key in the file.

    Examples
    --------
    >>> # Assuming 'lower_triangle.txt' contains:
    >>> # 1.0
    >>> # 0.1 1.0
    >>> # 0.2 0.4 1.0
    >>> # 0.3 0.5 0.6 1.0
    >>> matrix = load_ld_matrix('lower_triangle.txt')
    >>> print(matrix)
    array([[1.  , 0.1 , 0.2 , 0.3 ],
            [0.1 , 1.  , 0.4 , 0.5 ],
            [0.2 , 0.4 , 1.  , 0.6 ],
            [0.3 , 0.5 , 0.6 , 1.  ]])
    """
    if file_path.endswith(".npz"):
        with np.load(file_path) as data:
            ld_file_key = data.files[0]
            matrix = data[ld_file_key].astype(np.float32)
        return np.nan_to_num(matrix, nan=0.0)
    lower_triangle = read_lower_triangle(file_path, delimiter)

    # Create the symmetric matrix
    symmetric_matrix = lower_triangle + lower_triangle.T

    # Fill the diagonal with 1
    np.fill_diagonal(symmetric_matrix, 1)

    # convert to float32
    symmetric_matrix = symmetric_matrix.astype(np.float32)

    # Replace any NaNs with 0 to avoid propagating missing LD values
    symmetric_matrix = np.nan_to_num(symmetric_matrix, nan=0.0)
    return symmetric_matrix

read_lower_triangle(file_path, delimiter='\t')

Read a lower triangle matrix from a file.

Parameters:

Name Type Description Default
file_path str

Path to the input text file containing the lower triangle matrix.

required
delimiter str

Delimiter used in the input file, by default "\t".

'\t'

Returns:

Type Description
ndarray

Lower triangle matrix.

Raises:

Type Description
ValueError

If the input file is empty or does not contain a valid lower triangle matrix.

FileNotFoundError

If the specified file does not exist.

Notes

This function reads a lower triangular matrix where each row contains elements from the diagonal down to that row position.

Source code in credtools/ldmatrix.py
def read_lower_triangle(file_path: str, delimiter: str = "\t") -> np.ndarray:
    r"""
    Read a lower triangle matrix from a file.

    Parameters
    ----------
    file_path : str
        Path to the input text file containing the lower triangle matrix.
    delimiter : str, optional
        Delimiter used in the input file, by default "\t".

    Returns
    -------
    np.ndarray
        Lower triangle matrix.

    Raises
    ------
    ValueError
        If the input file is empty or does not contain a valid lower triangle matrix.
    FileNotFoundError
        If the specified file does not exist.

    Notes
    -----
    This function reads a lower triangular matrix where each row contains
    elements from the diagonal down to that row position.
    """
    try:
        if file_path.endswith(".gz"):
            with gzip.open(file_path, "rt") as file:
                rows = [
                    list(map(float, line.strip().split(delimiter)))
                    for line in file
                    if line.strip()
                ]
        else:
            with open(file_path, "r") as file:
                rows = [
                    list(map(float, line.strip().split(delimiter)))
                    for line in file
                    if line.strip()
                ]
    except FileNotFoundError:
        raise FileNotFoundError(f"The file '{file_path}' does not exist.")

    if not rows:
        raise ValueError("The input file is empty.")

    n = len(rows)
    lower_triangle = np.zeros((n, n))

    for i, row in enumerate(rows):
        if len(row) != i + 1:
            raise ValueError(
                f"Invalid number of elements in row {i + 1}. Expected {i + 1}, got {len(row)}."
            )
        lower_triangle[i, : len(row)] = row

    return lower_triangle

sort_alleles(ld)

Sort alleles in the LD map in alphabetical order. Change the sign of the LD matrix if the alleles are swapped.

Parameters:

Name Type Description Default
ld LDMatrix

LDMatrix object containing the Variant IDs and the LD matrix.

required

Returns:

Type Description
LDMatrix

LDMatrix object containing the Variant IDs and the LD matrix with alleles sorted.

Notes

This function ensures consistent allele ordering by:

  1. Sorting alleles alphabetically (A1 <= A2)
  2. Flipping the sign of LD correlations for variants where alleles were swapped
  3. Maintaining diagonal elements as 1.0

This is important for consistent merging across different datasets.

Examples:

>>> map_df = pd.DataFrame({
...     'SNPID': ['1-1000-A-G', '1-2000-C-T'],
...     'CHR': [1, 1],
...     'BP': [1000, 2000],
...     'A1': ['A', 'T'],
...     'A2': ['G', 'C']
... })
>>> r_matrix = np.array([[1. , 0.1],
...                      [0.1, 1. ]])
>>> ld = LDMatrix(map_df, r_matrix)
>>> sorted_ld = sort_alleles(ld)
>>> print(sorted_ld.map)
    SNPID       CHR    BP A1 A2
0   1-1000-A-G    1  1000  A  G
1   1-2000-C-T    1  2000  C  T
>>> print(sorted_ld.r)
array([[ 1. , -0.1],
        [-0.1,  1. ]])
Source code in credtools/ldmatrix.py
def sort_alleles(ld: LDMatrix) -> LDMatrix:
    """
    Sort alleles in the LD map in alphabetical order. Change the sign of the LD matrix if the alleles are swapped.

    Parameters
    ----------
    ld : LDMatrix
        LDMatrix object containing the Variant IDs and the LD matrix.

    Returns
    -------
    LDMatrix
        LDMatrix object containing the Variant IDs and the LD matrix with alleles sorted.

    Notes
    -----
    This function ensures consistent allele ordering by:

    1. Sorting alleles alphabetically (A1 <= A2)
    2. Flipping the sign of LD correlations for variants where alleles were swapped
    3. Maintaining diagonal elements as 1.0

    This is important for consistent merging across different datasets.

    Examples
    --------
    >>> map_df = pd.DataFrame({
    ...     'SNPID': ['1-1000-A-G', '1-2000-C-T'],
    ...     'CHR': [1, 1],
    ...     'BP': [1000, 2000],
    ...     'A1': ['A', 'T'],
    ...     'A2': ['G', 'C']
    ... })
    >>> r_matrix = np.array([[1. , 0.1],
    ...                      [0.1, 1. ]])
    >>> ld = LDMatrix(map_df, r_matrix)
    >>> sorted_ld = sort_alleles(ld)
    >>> print(sorted_ld.map)
        SNPID       CHR    BP A1 A2
    0   1-1000-A-G    1  1000  A  G
    1   1-2000-C-T    1  2000  C  T
    >>> print(sorted_ld.r)
    array([[ 1. , -0.1],
            [-0.1,  1. ]])
    """
    ld_df = ld.r.copy()
    ld_map = ld.map.copy()
    ld_map[["sort_a1", "sort_a2"]] = np.sort(ld_map[[ColName.A1, ColName.A2]], axis=1)
    swapped_index = ld_map[ld_map[ColName.A1] != ld_map["sort_a1"]].index
    # Change the sign of the rows and columns the LD matrix if the alleles are swapped
    ld_df[swapped_index] *= -1
    ld_df[:, swapped_index] *= -1
    np.fill_diagonal(ld_df, 1)

    ld_map[ColName.A1] = ld_map["sort_a1"]
    ld_map[ColName.A2] = ld_map["sort_a2"]
    ld_map.drop(columns=["sort_a1", "sort_a2"], inplace=True)
    return LDMatrix(ld_map, ld_df)

Summary Statistics

Functions for processing summary statistics data.

check_colnames(df)

Check column names in the DataFrame and fill missing columns with None.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame to check for column names.

required

Returns:

Type Description
DataFrame

DataFrame with all required columns, filling missing ones with None.

Notes

This function ensures that all required summary statistics columns are present in the DataFrame. Missing columns are added with None values.

Source code in credtools/sumstats.py
def check_colnames(df: pd.DataFrame) -> pd.DataFrame:
    """
    Check column names in the DataFrame and fill missing columns with None.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame to check for column names.

    Returns
    -------
    pd.DataFrame
        DataFrame with all required columns, filling missing ones with None.

    Notes
    -----
    This function ensures that all required summary statistics columns are present
    in the DataFrame. Missing columns are added with None values.
    """
    outdf: pd.DataFrame = df.copy()
    for col in ColName.sumstat_cols:
        if col not in outdf.columns:
            outdf[col] = None
    return outdf[ColName.sumstat_cols]

check_mandatory_cols(df)

Check if the DataFrame contains all mandatory columns.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to check for mandatory columns.

required

Returns:

Type Description
None

Raises:

Type Description
ValueError

If any mandatory columns are missing.

Notes

Mandatory columns are defined in ColName.mandatory_cols and typically include essential fields like chromosome, position, alleles, effect size, and p-value.

Source code in credtools/sumstats.py
def check_mandatory_cols(df: pd.DataFrame) -> None:
    """
    Check if the DataFrame contains all mandatory columns.

    Parameters
    ----------
    df : pd.DataFrame
        The DataFrame to check for mandatory columns.

    Returns
    -------
    None

    Raises
    ------
    ValueError
        If any mandatory columns are missing.

    Notes
    -----
    Mandatory columns are defined in ColName.mandatory_cols and typically include
    essential fields like chromosome, position, alleles, effect size, and p-value.
    """
    outdf = df.copy()
    missing_cols = set(ColName.mandatory_cols) - set(outdf.columns)
    if missing_cols:
        raise ValueError(f"Missing mandatory columns: {missing_cols}")
    return None

get_significant_snps(df, pvalue_threshold=5e-08, use_most_sig_if_no_sig=True)

Retrieve significant SNPs from the input DataFrame based on a p-value threshold.

Parameters:

Name Type Description Default
df DataFrame

The input summary statistics containing SNP information.

required
pvalue_threshold float

The p-value threshold for significance, by default 5e-8.

5e-08
use_most_sig_if_no_sig bool

Whether to return the most significant SNP if no SNP meets the threshold, by default True.

True

Returns:

Type Description
DataFrame

A DataFrame containing significant SNPs, sorted by p-value in ascending order.

Raises:

Type Description
ValueError

If no significant SNPs are found and use_most_sig_if_no_sig is False, or if the DataFrame is empty.

KeyError

If required columns are not present in the input DataFrame.

Notes

If no SNPs meet the significance threshold and use_most_sig_if_no_sig is True, the function returns the SNP with the smallest p-value.

Examples:

>>> data = {
...     'SNPID': ['rs1', 'rs2', 'rs3'],
...     'P': [1e-9, 0.05, 1e-8]
... }
>>> df = pd.DataFrame(data)
>>> significant_snps = get_significant_snps(df, pvalue_threshold=5e-8)
>>> print(significant_snps)
    SNPID         P
0    rs1  1.000000e-09
2    rs3  1.000000e-08
Source code in credtools/sumstats.py
def get_significant_snps(
    df: pd.DataFrame,
    pvalue_threshold: float = 5e-8,
    use_most_sig_if_no_sig: bool = True,
) -> pd.DataFrame:
    """
    Retrieve significant SNPs from the input DataFrame based on a p-value threshold.

    Parameters
    ----------
    df : pd.DataFrame
        The input summary statistics containing SNP information.
    pvalue_threshold : float, optional
        The p-value threshold for significance, by default 5e-8.
    use_most_sig_if_no_sig : bool, optional
        Whether to return the most significant SNP if no SNP meets the threshold, by default True.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing significant SNPs, sorted by p-value in ascending order.

    Raises
    ------
    ValueError
        If no significant SNPs are found and `use_most_sig_if_no_sig` is False,
        or if the DataFrame is empty.
    KeyError
        If required columns are not present in the input DataFrame.

    Notes
    -----
    If no SNPs meet the significance threshold and `use_most_sig_if_no_sig` is True,
    the function returns the SNP with the smallest p-value.

    Examples
    --------
    >>> data = {
    ...     'SNPID': ['rs1', 'rs2', 'rs3'],
    ...     'P': [1e-9, 0.05, 1e-8]
    ... }
    >>> df = pd.DataFrame(data)
    >>> significant_snps = get_significant_snps(df, pvalue_threshold=5e-8)
    >>> print(significant_snps)
        SNPID         P
    0    rs1  1.000000e-09
    2    rs3  1.000000e-08
    """
    required_columns = {ColName.P, ColName.SNPID}
    missing_columns = required_columns - set(df.columns)
    if missing_columns:
        raise KeyError(
            f"The following required columns are missing from the DataFrame: {missing_columns}"
        )

    sig_df = df.loc[df[ColName.P] <= pvalue_threshold].copy()

    if sig_df.empty:
        if use_most_sig_if_no_sig:
            min_pvalue = df[ColName.P].min()
            sig_df = df.loc[df[ColName.P] == min_pvalue].copy()
            if sig_df.empty:
                raise ValueError("The DataFrame is empty. No SNPs available to select.")
            logging.debug(
                f"Using the most significant SNP: {sig_df.iloc[0][ColName.SNPID]}"
            )
            logging.debug(f"p-value: {sig_df.iloc[0][ColName.P]}")
        else:
            raise ValueError("No significant SNPs found.")
    else:
        sig_df.sort_values(by=ColName.P, inplace=True)
        sig_df.reset_index(drop=True, inplace=True)

    return sig_df

load_sumstats(filename, if_sort_alleles=True, sep=None, nrows=None, skiprows=0, comment=None, gzipped=None)

Load summary statistics from a file.

Parameters:

Name Type Description Default
filename str

The path to the file containing the summary statistics. The header must contain the column names: CHR, BP, EA, NEA, EAF, BETA, SE, P.

required
if_sort_alleles bool

Whether to sort alleles in alphabetical order, by default True.

True
sep Optional[str]

The delimiter to use. If None, the delimiter is inferred from the file, by default None.

None
nrows Optional[int]

Number of rows to read. If None, all rows are read, by default None.

None
skiprows int

Number of lines to skip at the start of the file, by default 0.

0
comment Optional[str]

Character to split comments in the file, by default None.

None
gzipped Optional[bool]

Whether the file is gzipped. If None, it is inferred from the file extension, by default None.

None

Returns:

Type Description
DataFrame

A DataFrame containing the loaded summary statistics.

Notes

The function performs the following operations:

  1. Auto-detects file compression (gzip) from file extension
  2. Auto-detects delimiter (tab, comma, or space) from file content
  3. Loads the data using pandas.read_csv
  4. Applies comprehensive data munging and quality control
  5. Optionally sorts alleles for consistency

The function infers the delimiter if not provided and handles gzipped files automatically. Comprehensive quality control is applied including validation of chromosomes, positions, alleles, p-values, effect sizes, and frequencies.

Examples:

>>> # Load summary statistics with automatic format detection
>>> sumstats = load_sumstats('gwas_results.txt.gz')
>>> print(f"Loaded {len(sumstats)} variants")
Loaded 1000000 variants
>>> # Load with specific parameters
>>> sumstats = load_sumstats('gwas_results.csv', sep=',', nrows=10000)
>>> print(sumstats.columns.tolist())
['SNPID', 'CHR', 'BP', 'EA', 'NEA', 'EAF', 'BETA', 'SE', 'P', 'MAF', 'RSID']
Source code in credtools/sumstats.py
def load_sumstats(
    filename: str,
    if_sort_alleles: bool = True,
    sep: Optional[str] = None,
    nrows: Optional[int] = None,
    skiprows: int = 0,
    comment: Optional[str] = None,
    gzipped: Optional[bool] = None,
) -> pd.DataFrame:
    """
    Load summary statistics from a file.

    Parameters
    ----------
    filename : str
        The path to the file containing the summary statistics.
        The header must contain the column names: CHR, BP, EA, NEA, EAF, BETA, SE, P.
    if_sort_alleles : bool, optional
        Whether to sort alleles in alphabetical order, by default True.
    sep : Optional[str], optional
        The delimiter to use. If None, the delimiter is inferred from the file, by default None.
    nrows : Optional[int], optional
        Number of rows to read. If None, all rows are read, by default None.
    skiprows : int, optional
        Number of lines to skip at the start of the file, by default 0.
    comment : Optional[str], optional
        Character to split comments in the file, by default None.
    gzipped : Optional[bool], optional
        Whether the file is gzipped. If None, it is inferred from the file extension, by default None.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the loaded summary statistics.

    Notes
    -----
    The function performs the following operations:

    1. Auto-detects file compression (gzip) from file extension
    2. Auto-detects delimiter (tab, comma, or space) from file content
    3. Loads the data using pandas.read_csv
    4. Applies comprehensive data munging and quality control
    5. Optionally sorts alleles for consistency

    The function infers the delimiter if not provided and handles gzipped files automatically.
    Comprehensive quality control is applied including validation of chromosomes, positions,
    alleles, p-values, effect sizes, and frequencies.

    Examples
    --------
    >>> # Load summary statistics with automatic format detection
    >>> sumstats = load_sumstats('gwas_results.txt.gz')
    >>> print(f"Loaded {len(sumstats)} variants")
    Loaded 1000000 variants

    >>> # Load with specific parameters
    >>> sumstats = load_sumstats('gwas_results.csv', sep=',', nrows=10000)
    >>> print(sumstats.columns.tolist())
    ['SNPID', 'CHR', 'BP', 'EA', 'NEA', 'EAF', 'BETA', 'SE', 'P', 'MAF', 'RSID']
    """
    # determine whether the file is gzipped
    if gzipped is None:
        gzipped = filename.endswith("gz")

    # read the first line of the file to determine the separator
    if sep is None:
        if gzipped:
            f = gzip.open(filename, "rt")

        else:
            f = open(filename, "rt")
        if skiprows > 0:
            for _ in range(skiprows):
                f.readline()
        line = f.readline()
        f.close()
        if "\t" in line:
            sep = "\t"
        elif "," in line:
            sep = ","
        else:
            sep = " "
    logger.debug(f"File {filename} is gzipped: {gzipped}")
    logger.debug(f"Separator is {sep}")
    logger.debug(f"loading data from {filename}")
    # determine the separator, automatically if not specified
    sumstats = pd.read_csv(
        filename,
        sep=sep,
        nrows=nrows,
        skiprows=skiprows,
        comment=comment,
        compression="gzip" if gzipped else None,
    )
    sumstats = munge(sumstats)
    logger.info(f"Loaded {len(sumstats)} rows sumstats from {filename}")
    if if_sort_alleles:
        sumstats = sort_alleles(sumstats)
    return sumstats

make_SNPID_unique(sumstat, remove_duplicates=True, col_chr=ColName.CHR, col_bp=ColName.BP, col_ea=ColName.EA, col_nea=ColName.NEA, col_p=ColName.P)

Generate unique SNP identifiers to facilitate the combination of multiple summary statistics datasets.

Parameters:

Name Type Description Default
sumstat DataFrame

The input summary statistics containing SNP information.

required
remove_duplicates bool

Whether to remove duplicated SNPs, keeping the one with the smallest p-value, by default True.

True
col_chr str

The column name for chromosome information, by default ColName.CHR.

CHR
col_bp str

The column name for base-pair position information, by default ColName.BP.

BP
col_ea str

The column name for effect allele information, by default ColName.EA.

EA
col_nea str

The column name for non-effect allele information, by default ColName.NEA.

NEA
col_p str

The column name for p-value information, by default ColName.P.

P

Returns:

Type Description
DataFrame

The summary statistics DataFrame with unique SNPIDs, suitable for merging with other datasets.

Raises:

Type Description
KeyError

If required columns are missing from the input DataFrame.

ValueError

If the input DataFrame is empty or becomes empty after processing.

Notes

This function constructs a unique SNPID by concatenating chromosome, base-pair position, and sorted alleles (EA and NEA). This unique identifier allows for efficient merging of multiple summary statistics without the need for extensive duplicate comparisons.

The unique SNPID format: "chr-bp-sortedEA-sortedNEA"

If duplicates are found and remove_duplicates is False, a suffix "-N" is added to make identifiers unique, where N is the occurrence number.

Examples:

>>> data = {
...     'CHR': ['1', '1', '2'],
...     'BP': [12345, 12345, 67890],
...     'EA': ['A', 'A', 'G'],
...     'NEA': ['G', 'G', 'A'],
...     'RSID': ['rs1', 'rs2', 'rs3'],
...     'P': [1e-5, 1e-6, 1e-7]
... }
>>> df = pd.DataFrame(data)
>>> unique_df = make_SNPID_unique(df, remove_duplicates=True)
>>> print(unique_df)
    SNPID       CHR     BP EA NEA RSID         P
0  1-12345-A-G    1  12345  A   G  rs2  1.000000e-06
1  2-67890-A-G    2  67890  G   A  rs3  1.000000e-07
Source code in credtools/sumstats.py
def make_SNPID_unique(
    sumstat: pd.DataFrame,
    remove_duplicates: bool = True,
    col_chr: str = ColName.CHR,
    col_bp: str = ColName.BP,
    col_ea: str = ColName.EA,
    col_nea: str = ColName.NEA,
    col_p: str = ColName.P,
) -> pd.DataFrame:
    """
    Generate unique SNP identifiers to facilitate the combination of multiple summary statistics datasets.

    Parameters
    ----------
    sumstat : pd.DataFrame
        The input summary statistics containing SNP information.
    remove_duplicates : bool, optional
        Whether to remove duplicated SNPs, keeping the one with the smallest p-value, by default True.
    col_chr : str, optional
        The column name for chromosome information, by default ColName.CHR.
    col_bp : str, optional
        The column name for base-pair position information, by default ColName.BP.
    col_ea : str, optional
        The column name for effect allele information, by default ColName.EA.
    col_nea : str, optional
        The column name for non-effect allele information, by default ColName.NEA.
    col_p : str, optional
        The column name for p-value information, by default ColName.P.

    Returns
    -------
    pd.DataFrame
        The summary statistics DataFrame with unique SNPIDs, suitable for merging with other datasets.

    Raises
    ------
    KeyError
        If required columns are missing from the input DataFrame.
    ValueError
        If the input DataFrame is empty or becomes empty after processing.

    Notes
    -----
    This function constructs a unique SNPID by concatenating chromosome, base-pair position,
    and sorted alleles (EA and NEA). This unique identifier allows for efficient merging of
    multiple summary statistics without the need for extensive duplicate comparisons.

    The unique SNPID format: "chr-bp-sortedEA-sortedNEA"

    If duplicates are found and `remove_duplicates` is False, a suffix "-N" is added to make
    identifiers unique, where N is the occurrence number.

    Examples
    --------
    >>> data = {
    ...     'CHR': ['1', '1', '2'],
    ...     'BP': [12345, 12345, 67890],
    ...     'EA': ['A', 'A', 'G'],
    ...     'NEA': ['G', 'G', 'A'],
    ...     'RSID': ['rs1', 'rs2', 'rs3'],
    ...     'P': [1e-5, 1e-6, 1e-7]
    ... }
    >>> df = pd.DataFrame(data)
    >>> unique_df = make_SNPID_unique(df, remove_duplicates=True)
    >>> print(unique_df)
        SNPID       CHR     BP EA NEA RSID         P
    0  1-12345-A-G    1  12345  A   G  rs2  1.000000e-06
    1  2-67890-A-G    2  67890  G   A  rs3  1.000000e-07
    """
    required_columns = {
        col_chr,
        col_bp,
        col_ea,
        col_nea,
    }
    missing_columns = required_columns - set(sumstat.columns)
    if missing_columns:
        raise KeyError(
            f"The following required columns are missing from the DataFrame: {missing_columns}"
        )

    if sumstat.empty:
        raise ValueError("The input DataFrame is empty.")

    df = sumstat.copy()

    # Sort alleles to ensure unique representation (EA <= NEA)
    allele_df = df[[col_ea, col_nea]].apply(
        lambda row: sorted([row[col_ea], row[col_nea]]), axis=1, result_type="expand"
    )
    allele_df.columns = [col_ea, col_nea]

    # Create unique SNPID
    df[ColName.SNPID] = (
        df[col_chr].astype(str)
        + "-"
        + df[col_bp].astype(str)
        + "-"
        + allele_df[col_ea]
        + "-"
        + allele_df[col_nea]
    )

    # move SNPID to the first column
    cols = df.columns.tolist()
    cols.insert(0, cols.pop(cols.index(ColName.SNPID)))
    df = df[cols]

    n_duplicated = df.duplicated(subset=[ColName.SNPID]).sum()

    if remove_duplicates and n_duplicated > 0:
        logger.debug(f"Number of duplicated SNPs: {n_duplicated}")
        if col_p in df.columns:
            # Sort by p-value to keep the SNP with the smallest p-value
            df.sort_values(by=col_p, inplace=True)
        df.drop_duplicates(subset=[ColName.SNPID], keep="first", inplace=True)
        # Sort DataFrame by chromosome and base-pair position
        df.sort_values(by=[col_chr, col_bp], inplace=True)
        df.reset_index(drop=True, inplace=True)
    elif n_duplicated > 0 and not remove_duplicates:
        logger.warning(
            """Duplicated SNPs detected. To remove duplicates, set `remove_duplicates=True`.
            Change the Unique SNP identifier to make it unique."""
        )
        # Change the Unique SNP identifier to make it unique. add a number to the end of the SNP identifier
        #  for example, 1-12345-A-G to 1-12345-A-G-1, 1-12345-A-G-2, etc. no alteration to the original SNP identifier
        dup_tail = "-" + df.groupby(ColName.SNPID).cumcount().astype(str)
        dup_tail = dup_tail.str.replace("-0", "")
        df[ColName.SNPID] = df[ColName.SNPID] + dup_tail

    logger.debug("Unique SNPIDs have been successfully created.")
    logger.debug(f"Total unique SNPs: {len(df)}")

    return df

munge(df)

Munge the summary statistics DataFrame by performing a series of transformations.

Parameters:

Name Type Description Default
df DataFrame

The input DataFrame containing summary statistics.

required

Returns:

Type Description
DataFrame

The munged DataFrame with necessary transformations applied.

Raises:

Type Description
ValueError

If any mandatory columns are missing.

Notes

This function performs comprehensive data cleaning and standardization:

  1. Validates mandatory columns are present
  2. Removes entirely missing columns
  3. Cleans chromosome and position data
  4. Validates and standardizes allele information
  5. Creates unique SNP identifiers
  6. Validates p-values, effect sizes, and standard errors
  7. Processes allele frequencies
  8. Handles rsID information if present

The function applies strict quality control and may remove variants that don't meet validation criteria.

Source code in credtools/sumstats.py
def munge(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge the summary statistics DataFrame by performing a series of transformations.

    Parameters
    ----------
    df : pd.DataFrame
        The input DataFrame containing summary statistics.

    Returns
    -------
    pd.DataFrame
        The munged DataFrame with necessary transformations applied.

    Raises
    ------
    ValueError
        If any mandatory columns are missing.

    Notes
    -----
    This function performs comprehensive data cleaning and standardization:

    1. Validates mandatory columns are present
    2. Removes entirely missing columns
    3. Cleans chromosome and position data
    4. Validates and standardizes allele information
    5. Creates unique SNP identifiers
    6. Validates p-values, effect sizes, and standard errors
    7. Processes allele frequencies
    8. Handles rsID information if present

    The function applies strict quality control and may remove variants
    that don't meet validation criteria.
    """
    check_mandatory_cols(df)
    outdf = df.copy()
    outdf = rm_col_allna(outdf)
    outdf = munge_chr(outdf)
    outdf = munge_bp(outdf)
    outdf = munge_allele(outdf)
    outdf = make_SNPID_unique(outdf)
    outdf = munge_pvalue(outdf)
    outdf = outdf.sort_values(by=[ColName.CHR, ColName.BP])
    outdf = munge_beta(outdf)
    outdf = munge_se(outdf)
    if ColName.EAF in outdf.columns:
        outdf = munge_eaf(outdf)
        outdf[ColName.MAF] = outdf[ColName.EAF]
        outdf = munge_maf(outdf)
    if ColName.RSID in outdf.columns:
        outdf = munge_rsid(outdf)
    outdf = check_colnames(outdf)
    return outdf

munge_allele(df)

Munge allele columns.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with allele columns.

required

Returns:

Type Description
DataFrame

DataFrame with munged allele columns.

Notes

This function:

  1. Removes rows with missing allele values
  2. Converts alleles to uppercase
  3. Validates alleles contain only valid DNA bases (A, C, G, T)
  4. Removes variants where effect allele equals non-effect allele

Invalid alleles and monomorphic variants are removed and logged.

Source code in credtools/sumstats.py
def munge_allele(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge allele columns.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with allele columns.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged allele columns.

    Notes
    -----
    This function:

    1. Removes rows with missing allele values
    2. Converts alleles to uppercase
    3. Validates alleles contain only valid DNA bases (A, C, G, T)
    4. Removes variants where effect allele equals non-effect allele

    Invalid alleles and monomorphic variants are removed and logged.
    """
    validate = _get_validate_and_clean_column()
    _transform_allele = _get_transform_allele()
    outdf = df.copy()
    for col in [ColName.EA, ColName.NEA]:
        outdf = validate(
            df=outdf,
            col_name=col,
            col_type=ColType.EA,
            allow_na=ColAllowNA.EA,
            transform_func=_transform_allele,
        )
    outdf = outdf[outdf[ColName.EA] != outdf[ColName.NEA]]
    return outdf

munge_beta(df)

Munge beta column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with beta column.

required

Returns:

Type Description
DataFrame

DataFrame with munged beta column.

Notes

This function:

  1. Converts beta values to numeric type
  2. Removes rows with missing beta values
  3. Converts to appropriate data type

Invalid beta values are removed and logged.

Source code in credtools/sumstats.py
def munge_beta(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge beta column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with beta column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged beta column.

    Notes
    -----
    This function:

    1. Converts beta values to numeric type
    2. Removes rows with missing beta values
    3. Converts to appropriate data type

    Invalid beta values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.BETA,
        col_type=ColType.BETA,
        allow_na=ColAllowNA.BETA,
    )

munge_bp(df)

Munge position column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with position column.

required

Returns:

Type Description
DataFrame

DataFrame with munged position column.

Notes

This function:

  1. Removes rows with missing position values
  2. Converts position to numeric type
  3. Validates positions are within acceptable range (exclusive: > 0, < 300M)
  4. Converts to appropriate data type

Invalid position values are removed and logged.

Source code in credtools/sumstats.py
def munge_bp(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge position column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with position column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged position column.

    Notes
    -----
    This function:

    1. Removes rows with missing position values
    2. Converts position to numeric type
    3. Validates positions are within acceptable range (exclusive: > 0, < 300M)
    4. Converts to appropriate data type

    Invalid position values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.BP,
        col_type=ColType.BP,
        min_val=ColRange.BP_MIN,
        max_val=ColRange.BP_MAX,
        allow_na=ColAllowNA.BP,
        exclude_min=True,
        exclude_max=True,
    )

munge_chr(df)

Munge chromosome column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with chromosome column.

required

Returns:

Type Description
DataFrame

DataFrame with munged chromosome column.

Notes

This function:

  1. Removes rows with missing chromosome values
  2. Converts chromosome to string and removes 'chr' prefix
  3. Converts X chromosome to numeric value (23)
  4. Validates chromosome values are within acceptable range
  5. Converts to appropriate data type

Invalid chromosome values are removed and logged.

Source code in credtools/sumstats.py
def munge_chr(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge chromosome column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with chromosome column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged chromosome column.

    Notes
    -----
    This function:

    1. Removes rows with missing chromosome values
    2. Converts chromosome to string and removes 'chr' prefix
    3. Converts X chromosome to numeric value (23)
    4. Validates chromosome values are within acceptable range
    5. Converts to appropriate data type

    Invalid chromosome values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.CHR,
        col_type=ColType.CHR,
        min_val=ColRange.CHR_MIN,
        max_val=ColRange.CHR_MAX,
        allow_na=ColAllowNA.CHR,
        transform_func=_get_transform_chr(),
    )

munge_eaf(df)

Munge effect allele frequency column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with effect allele frequency column.

required

Returns:

Type Description
DataFrame

DataFrame with munged effect allele frequency column.

Notes

This function:

  1. Converts EAF values to numeric type
  2. Removes rows with missing EAF values
  3. Validates EAF values are within range [0, 1] (inclusive)
  4. Converts to appropriate data type

Invalid EAF values are removed and logged.

Source code in credtools/sumstats.py
def munge_eaf(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge effect allele frequency column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with effect allele frequency column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged effect allele frequency column.

    Notes
    -----
    This function:

    1. Converts EAF values to numeric type
    2. Removes rows with missing EAF values
    3. Validates EAF values are within range [0, 1] (inclusive)
    4. Converts to appropriate data type

    Invalid EAF values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.EAF,
        col_type=ColType.EAF,
        min_val=ColRange.EAF_MIN,
        max_val=ColRange.EAF_MAX,
        allow_na=ColAllowNA.EAF,
    )

munge_maf(df)

Munge minor allele frequency column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with minor allele frequency column.

required

Returns:

Type Description
DataFrame

DataFrame with munged minor allele frequency column.

Notes

This function:

  1. Converts MAF values to numeric type
  2. Removes rows with missing MAF values
  3. Converts frequencies > 0.5 to 1 - frequency (to ensure minor allele)
  4. Validates MAF values are within acceptable range
  5. Converts to appropriate data type

Invalid MAF values are removed and logged.

Source code in credtools/sumstats.py
def munge_maf(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge minor allele frequency column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with minor allele frequency column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged minor allele frequency column.

    Notes
    -----
    This function:

    1. Converts MAF values to numeric type
    2. Removes rows with missing MAF values
    3. Converts frequencies > 0.5 to 1 - frequency (to ensure minor allele)
    4. Validates MAF values are within acceptable range
    5. Converts to appropriate data type

    Invalid MAF values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.MAF,
        col_type=ColType.MAF,
        min_val=ColRange.MAF_MIN,
        max_val=ColRange.MAF_MAX,
        allow_na=ColAllowNA.MAF,
        transform_func=_transform_maf,
    )

munge_pvalue(df)

Munge p-value column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with p-value column.

required

Returns:

Type Description
DataFrame

DataFrame with munged p-value column.

Notes

This function:

  1. Converts p-values to numeric type
  2. Removes rows with missing p-values
  3. Validates p-values are within acceptable range (exclusive: > 0, < 1)
  4. Converts to appropriate data type

Invalid p-values are removed and logged.

Source code in credtools/sumstats.py
def munge_pvalue(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge p-value column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with p-value column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged p-value column.

    Notes
    -----
    This function:

    1. Converts p-values to numeric type
    2. Removes rows with missing p-values
    3. Validates p-values are within acceptable range (exclusive: > 0, < 1)
    4. Converts to appropriate data type

    Invalid p-values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.P,
        col_type=ColType.P,
        min_val=ColRange.P_MIN,
        max_val=ColRange.P_MAX,
        allow_na=ColAllowNA.P,
        exclude_min=True,
        exclude_max=True,
    )

munge_rsid(df)

Munge rsID column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with rsID column.

required

Returns:

Type Description
DataFrame

DataFrame with munged rsID column.

Notes

This function converts the rsID column to the appropriate data type as defined in ColType.RSID.

Source code in credtools/sumstats.py
def munge_rsid(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge rsID column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with rsID column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged rsID column.

    Notes
    -----
    This function converts the rsID column to the appropriate data type
    as defined in ColType.RSID.
    """
    outdf = df.copy()
    outdf[ColName.RSID] = outdf[ColName.RSID].astype(ColType.RSID)
    return outdf

munge_se(df)

Munge standard error column.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with standard error column.

required

Returns:

Type Description
DataFrame

DataFrame with munged standard error column.

Notes

This function:

  1. Converts standard error values to numeric type
  2. Removes rows with missing standard error values
  3. Validates standard errors are positive (exclusive: > 0)
  4. Converts to appropriate data type

Invalid standard error values are removed and logged.

Source code in credtools/sumstats.py
def munge_se(df: pd.DataFrame) -> pd.DataFrame:
    """
    Munge standard error column.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with standard error column.

    Returns
    -------
    pd.DataFrame
        DataFrame with munged standard error column.

    Notes
    -----
    This function:

    1. Converts standard error values to numeric type
    2. Removes rows with missing standard error values
    3. Validates standard errors are positive (exclusive: > 0)
    4. Converts to appropriate data type

    Invalid standard error values are removed and logged.
    """
    return _get_validate_and_clean_column()(
        df=df,
        col_name=ColName.SE,
        col_type=ColType.SE,
        min_val=ColRange.SE_MIN,
        allow_na=ColAllowNA.SE,
        exclude_min=True,
    )

rm_col_allna(df)

Remove columns from the DataFrame that are entirely NA.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame from which to remove columns.

required

Returns:

Type Description
DataFrame

A DataFrame with columns that are entirely NA removed.

Notes

This function also converts empty strings to None before checking for all-NA columns. Columns that contain only missing values are dropped to reduce memory usage and improve processing efficiency.

Source code in credtools/sumstats.py
def rm_col_allna(df: pd.DataFrame) -> pd.DataFrame:
    """
    Remove columns from the DataFrame that are entirely NA.

    Parameters
    ----------
    df : pd.DataFrame
        The DataFrame from which to remove columns.

    Returns
    -------
    pd.DataFrame
        A DataFrame with columns that are entirely NA removed.

    Notes
    -----
    This function also converts empty strings to None before checking for
    all-NA columns. Columns that contain only missing values are dropped
    to reduce memory usage and improve processing efficiency.
    """
    outdf = df.copy()
    outdf = outdf.replace("", None)
    for col in outdf.columns:
        if outdf[col].isnull().all():
            logger.debug(f"Remove column {col} because it is all NA.")
            outdf.drop(col, axis=1, inplace=True)
    return outdf

sort_alleles(df)

Sort EA and NEA in alphabetical order. Change the sign of beta if EA is not sorted as the first allele.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with allele columns.

required

Returns:

Type Description
DataFrame

DataFrame with sorted allele columns.

Notes

This function ensures consistent allele ordering by:

  1. Sorting effect allele (EA) and non-effect allele (NEA) alphabetically
  2. Flipping the sign of beta if alleles were swapped
  3. Adjusting effect allele frequency (EAF) if alleles were swapped (EAF = 1 - EAF)

This standardization is important for: - Consistent merging across datasets - Meta-analysis compatibility - LD matrix alignment

Source code in credtools/sumstats.py
def sort_alleles(df: pd.DataFrame) -> pd.DataFrame:
    """
    Sort EA and NEA in alphabetical order. Change the sign of beta if EA is not sorted as the first allele.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with allele columns.

    Returns
    -------
    pd.DataFrame
        DataFrame with sorted allele columns.

    Notes
    -----
    This function ensures consistent allele ordering by:

    1. Sorting effect allele (EA) and non-effect allele (NEA) alphabetically
    2. Flipping the sign of beta if alleles were swapped
    3. Adjusting effect allele frequency (EAF) if alleles were swapped (EAF = 1 - EAF)

    This standardization is important for:
    - Consistent merging across datasets
    - Meta-analysis compatibility
    - LD matrix alignment
    """
    outdf = df.copy()
    outdf[["sorted_a1", "sorted_a2"]] = np.sort(
        outdf[[ColName.EA, ColName.NEA]], axis=1
    )
    outdf[ColName.BETA] = np.where(
        outdf[ColName.EA] == outdf["sorted_a1"],
        outdf[ColName.BETA],
        -outdf[ColName.BETA],
    )
    if ColName.EAF in outdf.columns:
        outdf[ColName.EAF] = np.where(
            outdf[ColName.EA] == outdf["sorted_a1"],
            outdf[ColName.EAF],
            1 - outdf[ColName.EAF],
        )
    outdf[ColName.EA] = outdf["sorted_a1"]
    outdf[ColName.NEA] = outdf["sorted_a2"]
    outdf.drop(columns=["sorted_a1", "sorted_a2"], inplace=True)
    return outdf