IDmap Methods

To assess the variations in nsSNPs among major geographically structured populations, we utilized Wright's F-statistics to estimate autosomal genome-wide fixation index (F_ST)¹ at both the SNP and gene levels.

Wright's F-statistics, introduced by Sewall Wright in 1951², are fundamental tools in population genetics used to quantify genetic variation within and among populations. These statistics, comprising F_IS, F_ST, and F_IT, provide insights into the genetic structure and evolutionary dynamics of populations. F_IS, the inbreeding coefficient, measures the deviation from random mating within subpopulations, reflecting the impact of inbreeding on genetic diversity. F_ST, the fixation index, quantifies genetic differentiation among subpopulations, indicating the extent of allele frequency divergence due to genetic drift or selection. F_IT, the inbreeding coefficient relative to the total population, assesses the overall genetic variability reduction within individuals. Collectively, these metrics elucidate the interplay of evolutionary forces such as mutation, migration, selection, and drift, offering a comprehensive framework for understanding the genetic landscape and evolutionary processes shaping populations.

Diagram illustrating the approach that estimates F_ST value. The F_ST statistic quantifies genetic differentiation between subpopulations within a larger metapopulation, measuring the proportion of genetic variation attributed to differences between subpopulations. Higher F_ST values indicate greater genetic differentiation and limited gene flow between subpopulations. The F_ST value is calculated by comparing the average heterozygosity within a subpopulation (Hs) to the average heterozygosity within the metapopulation (Ht).

In this study, genetic differentiation for both autosomal protein-coding genes and nsSNP loci was measured using one of Wright's F statistics, the fixation index Fst^1,2. The F_ST value quantifies the proportion of genetic variance based on allele frequencies³ and is calculated as follows:

where HT represents the variation between populations and HS represents the variation within populations.

By comparing the average heterozygosity within a subpopulation to the average heterozygosity within the metapopulation, we could estimate the genetic differentiation within and among populations. The R package 'popgenome'⁴ was used to estimate both global F_ST and pairwise F_ST values between subpopulations. VCF files from the 1000 Genomes Phase 3 for 22 autosomes were used as input. The genome-wide data were then split into genes based on genomic positions using the 'splitting.data()' function. Global F_ST was calculated using the 'F__ST.stats()' function in the mode of "nucleotide". Pairwise F_ST values were extracted using 'nuc.F__ST.pairwise'. Site-specific F_ST values were obtained by setting 'subsites = nonsyn, site.F_ST=True'.

References:
1. Holsinger, K.E. & Weir, B.S. Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet 10, 639-650 (2009).
2. Wright, S. The genetical structure of populations. Ann Eugen 15, 323-354 (1951).
3. Hudson, R.R., Slatkin, M. & Maddison, W.P. Estimation of levels of gene flow from DNA sequence data. Genetics 132, 583-589 (1992).
4. Pfeifer, B., Wittelsburger, U., Ramos-Onsins, S.E. & Lercher, M.J. PopGenome: an efficient Swiss army knife for population genomic analyses in R. Mol Biol Evol 31, 1929-1936 (2014).