3.2. Single locus analyses

3.2.1. Basic allele count information

Information relevant to individual loci is reported. Sample size and allele counts will differ among loci if not all individuals were typed at each locus. Untyped individuals are those for which one or two alleles were not reported. The alleles are listed in descending frequency (and count) in the left hand column, and are sorted numerically in the right column. The number of distinct alleles k is reported.

Example 3.2. Basic locus information sample output

I. Single Locus Analyses

1. Locus: A

1.1. Allele Counts [A]
Untyped individuals: 2
Sample Size (n): 45
Allele Count (2n): 90
Distinct alleles (k): 10

Counts ordered by frequency   | Counts ordered by name        
Name      Frequency (Count)   | Name      Frequency (Count)   
0201      0.21111   19        | 0101      0.13333   12        
0301      0.15556   14        | 0201      0.21111   19        
0101      0.13333   12        | 0210      0.10000   9         
2501      0.12222   11        | 0218      0.10000   9         
0210      0.10000   9         | 0301      0.15556   14        
0218      0.10000   9         | 2501      0.12222   11        
3204      0.08889   8         | 3204      0.08889   8         
6901      0.04444   4         | 6814      0.03333   3         
6814      0.03333   3         | 6901      0.04444   4         
7403      0.01111   1         | 7403      0.01111   1         
Total     1.00000   90        | Total     1.00000   90        

In the cases where there is no information for a locus, a message is displayed indicating lack of data.

Sample output:

4. Locus: DRA
 No data for this locus!

3.2.2. Chi-square test for deviation from Hardy-Weinberg proportions (HWP).

For each locus, the observed genotype counts are compared to those expected under Hardy Weinberg proportions (HWP). A triangular matrix reports observed and expected genotype counts. If the matrix is more than 80 characters, the output is split into different sections. Each cell contains the observed and expected number for a given genotype in the format observed/expected.

Example 3.3. Sample output of Hardy-Weinberg genotype table

6.2. HardyWeinberg [DQA1]
Table of genotypes, format of each cell is: observed/expected.

0201 8/5.1
0301 4/4.0 1/0.8
0401 3/6.9 1/2.7 6/2.3
0501 8/9.9 5/3.8 5/6.7 6/4.8
      0201  0301  0401  0501
                             [Cols: 1 to 4]

The values in this matrix are used to test hypotheses of deviation from HWP. The output also includes the chi-square statistic, the number of degrees of freedom and associated p-value for a number of classes of genotypes and is summarized in the following table:

Example 3.4. Sample output of HW genotype classes

                      Observed    Expected  Chi-square   DoF   p-value   
            Common         N/A         N/A        4.65     1  0.0310* 1  
  Lumped genotypes         N/A         N/A        1.17     1  0.2797  2
   Common + lumped         N/A         N/A        5.82     1  0.0158* 3
   All homozygotes          21       13.01        4.91     1  0.0268* 4
 All heterozygotes          26       33.99        1.88     1  0.1706  5
Common heterozygotes by allele                                        6
              0201          15       20.78        1.61        0.2050      
              0301          10       10.47        0.02        0.8850      
              0401           9       16.31        3.28        0.0703      
              0501          18       20.43        0.29        0.5915      

Common genotypes                                                      7
         0201:0201           8        5.11        1.63        0.2014      
         0201:0401           3        6.93        2.23        0.1358      
         0201:0501           8        9.89        0.36        0.5472      
         0401:0501           5        6.70        0.43        0.5109      
             Total          24       28.63

Explanation of each genotype class


Common. The result for goodness of fit to HWP using only the genotypes with at least lumpBelow expected counts (the common genotypes) (in the output shown throughout this example lumpBelow is equal to 5).

If the dataset contains no genotypes with expected counts equal or greater than lumpBelow, then there are no common genotypes and the following message is reported:

   No common genotypes; chi-square cannot be calculated

The analysis of common genotypes may lead to a situtation where there are fewer classes (genotypes) than allele frequencies to estimate. This means that the analysis cannot be performed (degrees of freedom < 1). In such a case the following message is reported, explaining why the analysis could not be performed:

   Too many parameters for chi-square test.

To obviate this as much as possible, only alleles which occur in common genotypes are used in the calculation of degrees of freedom.


Lumped genotypes. The result for goodness of fit to HWP for the pooled set of genotypes that individually have less than lumpBelow expected counts.

The pooling procedure is designed to avoid carrying out the chi-square goodness of fit test in cases where there are low expected counts, which could lead to spurious rejection of HWP. However, in certain cases it may not be possible to carry out this pooling approach. The interpretation of results based on lumped genotypes will depend on the particular genotypes that are combined in this class.

If the sum of expected counts in the lumped class does not add up to lumpBelow, then the test for the lumped genotypes cannot be calculated and the following message is reported:

   The total number of expected genotypes is less than 5

This may by remedied by combining rare alleles and recalculating overall chi-square value and degrees of freedom. (This would require appropriate manipulation of the data set by hand and is not a feature of PyPop).


Common + lumped. The result for goodness of fit to HWP for both the common and the lumped genotypes.


All homozygotes. The result for goodness of fit to HWP for the pooled set of homozygous genotypes.


All heterozygotes. The result for goodness of fit to HWP for the pooled set of heterozygous genotypes.


Common heterozygotes. The common heterozygotes by allele section summarizes the observed and expected number of counts of all heterozygotes carrying a specific allele with expected value ≥ lumpBelow.


Common genotypes. The common genotypes by genotype section lists observed, expected, chi-square and p-values for all observed genotypes with expected values ≥ lumpBelow.

3.2.3. Exact test for deviation from HWP

If enabled in the configuration file, the exact test for deviations from HWP will be output. The exact test uses the method of Guo & Thompson (1992). The p-value provided describes how probable the observed set of genotypes is, with respect to a large sample of other genotypic configurations (conditioned on the same allele frequencies and 2n). p-values lower than 0.05 can be interpreted as evidence that the sample does not fit HWP. In addition, those individual genotypes deviating significantly (p-values < 0.05) from expected HWP as computed with the Chen and "diff" measures are reported.

There are two implementations for this test, the first using the gthwe implementation originally due to Guo & Thompson, but modified by John Chen, the second being Arlequin's (Schneider et al., 2000) implementation.

Example 3.5. Sample output for exact test using gthwe

6.3. Guo and Thompson HardyWeinberg output [DQA1]
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error: 0.0009431 
p-value (overall): 0.0537

Example 3.6. Sample output for exact test using the Arlequin implementation

6.4. Guo and Thompson HardyWeinberg output(Arlequin's implementation) [DQA1]
Observed heterozygosity: 0.553190
Expected heterozygosity: 0.763900
Std. deviation: 0.000630
Dememorization steps: 100172
p-value: 0.0518

Note that in the Arlequin implementation, the output is slightly different, and the only directly comparable value between the two implementation is the p-value. These p-values may be slightly different, but should agree to within one significant figure.

3.2.4. The Ewens-Watterson homozygosity test of neutrality

For each locus, we implement the Ewens-Watterson homozygosity test of neutrality (Ewens 1972; Watterson 1978). We use the term observed homozygosity to denote the homozygosity statistic (F), computed as the sum of the squared allele frequencies. This value is compared to the expected homozygosity which is computed by simulation under neutrality/equilibrium expectations, for the same sample size (2n) and number of unique alleles (k). Note that the homozygosity F statistic, , is often referred to as the expected homozygosity (with expectation referring to HWP) to distinguish it from the observed proportion of homozygotes. We avoid referring to the observed F statistic as the "observed expected homozygosity" (to simplify and hopefully avoid confusion) since the homozygosity test of neutrality is concerned with comparisons of observed results to expectations under neutrality. Both the observed statistic (based on the actual data) and expected statistic (based on simulations under neutrality) used in this test are computed as the sum of the squared allele frequencies.

The normalized deviate of the homozygosity (Fnd) is the difference between the observed homozygosity and expected homozygosity, divided by the square root of the variance of the expected homozygosity (also obtained by simulations; Salamon et al. (1999)). Significant negative normalized deviates imply observed homozygosity values lower than expected homozygosity, in the direction of balancing selection. Significant positive values are in the direction of directional selection.

The p-value in the last row of the output is the probability of obtaining a homozygosity F statistic under neutral evolution that is less than or equal to the observed F statistic. It is computed based on the null distribution of homozygosity F values simulated under neutrality/equilibrium conditions for the same sample size (2n) and number of unique alleles (k). For a one-tailed test of the null hypothesis of neutrality against the alternative of balancing selection, p-values less than 0.05 are considered significant at the 0.05 level. For a two-tailed test against the alternative of either balancing or directional selection, p-values less than 0.025 or greater than 0.975 can be considered significant at the 0.05 level.

Example 3.7. Sample output of homozygosity test from Monte-Carlo implementation

The standard implementation of the test uses a Monte-Carlo implementation of the exact test written by Slatkin (Slatkin 1994; Slatkin 1996). A Markov-chain Monte Carlo method is used to obtain the null distribution of the homozygosity statistic under neutrality. The reported p-values are one-tailed (against the alternative of balancing selection), but can be interpreted for a two-tailed test by considering either extreme of the distribution (< 0.025 or > 0.975) at the 0.05 level.

1.6. Slatkin's implementation of EW homozygosity test of neutrality [A]
Observed F: 0.1326, Expected F: 0.2654, Variance in F: 0.0083
Normalized deviate of F (Fnd): -1.4603, p-value of F: 0.0029**


The version of this test based on tables of simulated percentiles of the Ewens-Watterson statistics is now disabled by default and its use is deprecated in preference to the Slatkin exact test described above, however some older PyPop runs may include output, so it is documented here for completeness. This version differs from the Monte-Carlo Markov Chain version described above in that the data is simulated under neutrality to obtain the required statistics.

Example 3.8. Sample output of homozygosity test from simulation look-up tables (disabled by default)

1.4. Ewens-Watterson homozygosity test of neutrality [A]
Observed F: 0.1326, Expected F: 0.2651, Normalized deviate (Fnd): -1.4506
p-value range: 0.0000 < p <= 0.0100 *