File formats
Summary Statistics¶
easyfinemap requires only one input file, which is the GWAS summary statistics file. This file typically does not have a fixed format and is essentially a tabular file where each row corresponds to the information and results of the association analysis for a specific SNP. To facilitate processing and speed up analysis, we have designed a unified format based on tabix. The file should have exactly 10 columns, and CHR and BP should be indexed using tabix.
The 10 columns in the file are as follows:
- CHR: Chromosome (integer)
- BP: Base pair position (integer)
- rsID: rsID of the SNP (string, allows null values)
- EA: Effective allele (string)
- NEA: Non-effective allele (string)
- EAF: Effective allele frequency (float, allows null values)
- MAF: Minor allele frequency (float, allows null values)
- BETA: Effect size (float)
- SE: Standard error (float, non-zero values)
- P: p-value (positive float)
By following this format and indexing CHR and BP using tabix, you can ensure compatibility and efficient processing of the GWAS summary statistics file with easyfinemap.
Users can easily convert summary statistics from other formats into this format using Smunger.
LD reference (Optional)¶
To perform LD-based fine-mapping, users need to provide individual genotype data to calculate LD. Ideally, these genotypes should be matched to the sample of summary statistics. However, it is common practice to use publicly available reference panels such as 1000 Genomes (1000G). Since easyfinemap uses PLINK v1.9 to calculate LD, the required genotype data format is PLINK's bfile format.
Users need to split the genotype data by chromosome and convert it to the PLINK bfile format. Then, they can use the easyfinemap validate-ldref
command to format the LD reference.