Validate LD reference

The validation of the LD reference is performed to facilitate the calculation of the LD matrix. However, if you intend to use LD-free fine-mapping methods, you can disregard this step.

The validate-ldref method in EasyFinemap quickly validates the PLINK bfile format. The main steps involved are as follows:

If the input is not already separated by chromosome, the PLINK bfile is split by chromosome.
Variants with a minor allele count below a specified threshold (default is 10) are filtered out.
Multiallelic variants are removed.
Duplicate variants are removed.
Variant IDs are converted to Unique SNP IDs for easy matching of summary statistics with variants in the LD reference. In EasyFinemap, the Unique SNP ID format used is chr-bp-sorted(EA,NEA).

Let's demonstrate the usage of validate-ldref with an example dataset.

Download the example data:

git clone https://github.com/Jianhua-Wang/easyfinemap.git
cd easyfinemap/exampledata
ls

The exampledata directory contains bfiles with the prefix EUR.chr21 for chromosome 21, EUR.chr22 for chromosome 22, and EUR.chr21-22, which is a merged file of chromosomes 21 and 22.

validate-ldref supports both split-by-chromosome bfiles (such as those directly converted from chromosome-separated VCF files from 1000 Genomes) and non-split bfiles. The specific command is as follows:

easyfinemap validate-ldref ./EUR.chr21-22 EUR.valid

For split-by-chromosome bfiles, use the {chrom} wildcard to represent the chromosome number.

easyfinemap validate-ldref ./EUR.chr{chrom} EUR.valid

Other parameters:

$ easyfinemap validate-ldref -h
────────────────────────────────── EasyFinemap ───────────────────────────────────
                                  Version: 0.3.9
                               Author: Jianhua Wang
                          Email: jianhua.mert@gmail.com

 Usage: easyfinemap validate-ldref [OPTIONS] LDREF_PATH OUTPREFIX

 Validate the LD reference file.

╭─ Arguments ────────────────────────────────────────────────────────────────────╮
│ *    ldref_path      TEXT  The path to the LD reference file. [default: None]  │
│                            [required]                                          │
│ *    outprefix       TEXT  The output prefix. [default: None] [required]       │
╰────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────╮
│ --file-type  -f      TEXT     The file type of the LD reference file.          │
│                               [default: plink]                                 │
│ --mac        -m      INTEGER  The minor allele count threshold. [default: 10]  │
│ --threads    -t      INTEGER  The number of threads. [default: 1]              │
│ --help       -h               Show this message and exit.                      │
╰────────────────────────────────────────────────────────────────────────────────╯

The -m option allows you to change the minor allele count threshold for filtering. It should be set to a value greater than 0 because a minor allele count of 0 can cause errors in LD matrix calculations. The -t option specifies the number of threads. Setting it higher can speed up the process. Parallelization is performed by chromosome, so it should not exceed the total number of chromosomes.