Estimating optimal sample size for genetic differentiation
using analytical and bootstrap techniques
Juan F. Fernandez-M.
Department of Biology
University of Missouri, St. Louis
St. Louis, MO  63121-4499
email:  S997022@admiral.umsl.edu

Precision in the analysis of the distribution of genetic diversity and estimation of gene flow rates among populations, is constrained by the sampling design i.e., number of populations and number of individuals per population.  Few attempts have been made to analytically determine a sample size large enough that will yield statistically significant estimates of genetic differentiation or gene flow e.g., . Here, I use the methods proposed by  to estimate the optimal sample size per population when the total sample size is held constant based on the premise of minimizing the variance of Gst from known genetic data. Allozyme genetic data from five loci (AAT-2, AAT-3, DIA-1, DIA-2, and MNR-2) from  Sassafras albidum (Lauraceae) from 36 subpopulations from the Missouri Ozarks, was analyzed using the program HaploDiv (Petit 1995). Although the program is intended for haploid data , it yields close results to a diploid procedure if the species is outcrossed (Petit, pers. comm.). The original sample sizes were between 24 and 48 individuals per population.

 Only the loci that showed a significant genetic differentiation (MNR-2 Gst = 0.3389, and DIA-2 Gst = 0.0991) were useful in estimating the optimal sample size. The results indicate that 4 diploid individuals for the MNR-2 locus, and 9 for the DIA-2 locus per population are enough to detect population differentiation at those loci.

For the low differentiated loci (AAT-2, AAT-3 and DIA-1) a resample analysis was performed simulating the 36 populations with a constant sample size (n = 10, 20, ...100) per population to estimate empirically when the bootstrap  95% confidence interval on Gst values approached the observed value for the total data. The simulated samplings suggested: 1) that at least 20 individuals per population are required for a 95% confidence interval to overalap with the true mean Gst ; 2) that the estimated variance stabilizes when sample size is greater than 30 individuals per population; and 3) that the estimator that approaches the true value the better is the Gst estimator proposed by Pons and Chaouche (1995).  For a locus by locus analysis, the least differntiated locus will determine the minimum sample requirements.