Analytical Framework for Identifying and Differentiating Recent Hitchhiking and Severe Bottleneck Effects from Multi-Locus DNA Sequence Data

Ori Sargsyan

doi:10.1371/journal.pone.0037588

Abstract

Hitchhiking and severe bottleneck effects have impact on the dynamics of genetic diversity of a population by inducing homogenization at a single locus and at the genome-wide scale, respectively. As a result, identification and differentiation of the signatures of such events from DNA sequence data at a single locus is challenging. This paper develops an analytical framework for identifying and differentiating recent homogenization events at multiple neutral loci in low recombination regions. The dynamics of genetic diversity at a locus after a recent homogenization event is modeled according to the infinite-sites mutation model and the Wright-Fisher model of reproduction with constant population size. In this setting, I derive analytical expressions for the distribution, mean, and variance of the number of polymorphic sites in a random sample of DNA sequences from a locus affected by a recent homogenization event. Based on this framework, three likelihood-ratio based tests are presented for identifying and differentiating recent homogenization events at multiple loci. Lastly, I apply the framework to two data sets. First, I consider human DNA sequences from four non-coding loci on different chromosomes for inferring evolutionary history of modern human populations. The results suggest, in particular, that recent homogenization events at the loci are identifiable when the effective human population size is 50000 or greater in contrast to 10000, and the estimates of the recent homogenization events are agree with the “Out of Africa” hypothesis. Second, I use HIV DNA sequences from HIV-1-infected patients to infer the times of HIV seroconversions. The estimates are contrasted with other estimates derived as the mid-time point between the last HIV-negative and first HIV-positive screening tests. The results show that significant discrepancies can exist between the estimates.

Citation: Sargsyan O (2012) Analytical Framework for Identifying and Differentiating Recent Hitchhiking and Severe Bottleneck Effects from Multi-Locus DNA Sequence Data. PLoS ONE 7(5): e37588. https://doi.org/10.1371/journal.pone.0037588

Editor: David Caramelli, University of Florence, Italy

Received: March 6, 2012; Accepted: April 21, 2012; Published: May 25, 2012

This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Funding: This work was supported by the United States Department of Energy through the LANL/LDRD Program and by the National Institute of Health [grant number 5R01AI08752002]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The author has declared that no competing interests exist.

Introduction

Hitchhiking and severe bottleneck effects have similar signatures on the population genome by reseting the molecular clock. However, their impacts at the genome level are on different scales. The hitchhiking effect has a local signature because recombination breaks down linkage disequilibrium between sites on the genome; consequently, the locus completely linked to a site under a positive selection becomes homogenous in the population [1]. In contrast, after relatively quick recovery of a population from a severe bottleneck, it becomes genome-wide homogeneous. Identifying and differentiating recent such events at a single locus can be challenging because both processes have similar signature on the genetic diversity at single locus. Thus, multi-locus DNA sequence data can be a powerful source for this purpose.

After a recent homogenization event at a neutral locus, the accumulated genetic diversity at the locus and the elapsed time are positively correlated when assuming constant molecular clock. To quantify the relation between genetic diversity at the neutral locus in a low recombination region and the time elapsed since a recent homogenization event, Griffiths [2], Tajima [3], and Perliz and Stephan [4] used Wright-Fisher reproduction model with constant population size and infinite-sites model [5] for the dynamics of genetic diversity at the locus. They derived analytical expressions for the expected number of polymorphic sites in a sample of DNA sequences from such a locus. Although this framework is computationally efficient for inferring the elapsed time, it is applicable only for a single locus.

Simulation based inference methods have been developed for the same problem to include an exponential population growth model and full polymorphism data in samples of DNA sequences [6], [7]. Although such methods have flexibility to include more complex evolutionary scenarios, they are computationally more intense.

I consider the same setting as in [2], [3], and [4] to develop an analytical framework for identifying and differentiating recent homogenization events at multiple neutral loci in low recombination regions. The loci are considered to be evolving independently, for example, when the loci are on different chromosomes or on same chromosome but far apart. I derive an analytical expression for the probability distribution of the number of polymorphic sites in a sample of DNA sequences. Based on this, I described likelihood-ratio based tests for identifying and differentiating recent homogenization events at multiple loci. I apply the framework to two data sets. First, I use human DNA sequence data to infer evolutionary history and origin of modern human populations. Second, I use HIV DNA sequences sampled from HIV-1-infected patients to infer the times of HIV seroconversions.

Methods

The population genetic model

Genetic diversity at a neutral locus, in a low recombination region, affected by a recent homogenization event is a result of mutations accumulated at the locus since the homogenization event. To model the dynamics of genetic diversity at such a locus after the homogenization event, this paper combines the infinite-sites mutation model and the Wright-Fisher reproduction model with constant population size. The parameters in the model represent the effective population size , the elapsed time since the last homogenization event, mutation rate per generation per sequence, and the (effective) generation time .

Variation in a sample of DNA sequences drawn from a population evolving according to this model can be described as a combination of genealogical and mutation processes. The genealogical process traces ancestral lineages of the sample back in time until the recent homogenization event at time and stops earlier if the most recent common ancestor of the sample is more recent than the homogenization event. When is large and the time in this process is measured in generations, the genealogical process can be approximated by a coalescent process derived from the standard coalescent [8]–[12]. Here is a scaled population size at the locus, determined by and the type of the chromosome on which the locus is located: is equal to for the case of a haploid population; for a diploid population with males and females, is equal to , , , or if the locus is on the , the X, the autosomal chromosome, or on the mitochondrial DNA, respectively. In this process the ancestral lineages of the sample are traced until time and mutations are added on the branches of the genealogy as independent Poisson processes with rates equal to , . In the infinite-sites model, each mutation occurs at a nucleotide site that has not been mutated before.

Results

Probability distribution of the number of polymorphic sites in a sample of DNA sequences

Under the model described above, the probability distribution of the number of polymorphic sites in a sample of DNA sequences can be represented as(1)where is the total length of the genealogy of sequences. This equation suggests that the probability can be expressed through the derivatives of the moment generating function of , defined as :(2)

Griffiths [2] derived an analytical expression for , but it can not be easily used to derive expressions for the derivatives of . In the following lemma, I derive an expression for , which allows easily to derive analytical expressions for the derivatives of . Note that this expression also allows to invert the moment generating function and to derive an analytic expression for the density function of . The latter result is presented in the lemma of the Text S1.

Lemma 1 The moment generating function can be represented as(3)whereThe coefficients are determined by the following recurrence relations with initial conditions:(4)(5)(6)(7)

The prove of Lemma 1 is provided in the Text S1.

Note that the coefficients satisfy the following identities(8)(9)(10)The identities are used in the proof of Lemma 1, and also for identifying numerical instability issues with computation of based on (4)–(7) when using decimal approximations instead of exact computations. The proof of the identities can be done by combining mathematical induction with (4)–(7), the details not shown.

Expression (3) is used to derive expressions for the derivatives of , but for computational purposes they are modified to derive numerically stable expressions. The following procedure is applied to the expressions to solve the instability issue: for each , , the terms with factor are combined together and the common term is factored out. For example, a numerically stable expression for is(11)whereThe numerical instability of the expression (3) is illustrated in Figure 1.

Download:

Figure 1. Illustration of numerical instability of the expression (3).

The moment generating function is plotted for the same range of values of in red and blue dots by using the expressions (11) and (3), respectively. The numerical instability of the expression (3) is obvious because the values of must be between 0 and 1 for any positive .

https://doi.org/10.1371/journal.pone.0037588.g001

To derive a numerically stable expression for by using (2), first expressions are derived for the derivatives of with respect to by using Lemma 1 and the identity(12)After applying the numerical stabilization procedure (described above) to these expressions, a numerically stable expression for the probability distribution is(13)whereI have implemented the formula (13) and all the other formulas in this paper in a program written in Mathematica [13]. The program is used to carry out all the calculations and visualizations in this paper. The program uses Mathematica's ability of doing exact computations with fractions, as a result avoiding numerical instability issues. The program is available from the author on request.

Note that when is large, the following approximation holds for the probability distribution :(14)The right side of the approximation corresponds to the probability distribution of the number of polymorphic sites in a sample of sequences under a “simple” model where the ancestral lineages of the sample are traced until time without coalescence. Note that population size is not a factor in this model because the right side of the above approximation can be represented as

Mean and variance of the number of polymorphic sites in a sample of DNA sequences

In previous studies [2]–[4], expressions have been derived for the mean number of polymorphic sites in a sample of DNA sequences from a locus affected by a recent homogenization event. An expression for the variance is also derived in [4], but this expression is implicit because it includes integral expressions. Using a similar approach as in the previous section, I derive a numerically stable expressions for computing the mean and variance of the number of polymorphic sites in a sample of DNA sequences from such a locus. The conditional probability distribution of when is given is Poisson with a mean of . The mean and variance of can be expressed as follows:andExpressions for the first and second moments of are derived by taking the first and second derivatives of (3) with respect to and evaluating them at . After applying the numerical stabilization procedure (described in the previous section) to these expressions, the first and second moments of can be computed using the following formulas:where is defined aswhere is

Three tests for identifying and differentiating recent homogenization events at multiple loci

Using the probabilistic framework developed above, three likelihood-ratio based tests are considered in this section for identifying and differentiating recent homogenization events at independently evolving multiple neutral loci in low recombination regions.

Test I.

To identify a recent homogenization event at a locus based on the number of polymorphic sites in a sample of DNA sequences, the hypothesis versus is considered. The null hypothesis represents a case in which ancestral population was evolving according to the Wright-Fisher model with constant population size. The null hypothesis can be tested by defining minus twice of the log of the likelihood-ratio statistics asand comparing it with a distribution with d.f..

Note that corresponds to the probability distribution of the number of polymorphic sites in a sample of DNA sequences when the genealogy of the sample is modeled by the standard coalescent and assumed the infinite-sites model for mutations. Tavaré [14] derived an expression for , which also follows from (13) by taking to :(15)

Test II.

Suppose we know, for example, from other studies, that a recent homogenization event occurred at time and we want to identify if this event had impact on a locus of interest. Symbolically, the following hypothesis can be statedThe null hypothesis can be tested by comparing minus twice of the likelihood-ratio statisticswith a distribution with d.f., where .

Based on this approximation, for each a confidence intervalof is determined by solving the equationwith respect to ; is the critical value of the distribution. Note that when is 0, then and is the solution of the above equation. A confidence interval of is . One can use a similar approach to estimate a confidence interval for when inferring the elapsed time of a recent severe bottleneck event based on DNA sequence data from independently evolving multiple neutral loci. Another approach for this case is described below.

Test III.

For a case of independent neutral loci, let the loci be labeled from 1 to , and be the number of polymorphic sites in a sample of sequences at locus , . To test if the multiple loci are affected by the same recent homogenization event, the following hypothesis is considered:where is the time elapsed since a recent homogenization event at locus . The null hypothesis can be tested by comparing the statisticswith a distribution with d.f., where is the scaled population size at locus ; is the scaled mutation rate at locus , and is the mutation rate per generation per sequence at locus .

Inferring the time of a recent severe bottleneck event based on polymorphism data at multiple loci

The following steps can be taken to infer the time of a recent severe bottleneck event from DNA sequence data at independently evolving multiple neutral loci in low recombination regions. The likelihood function for such a data set can be computed as a product of likelihood functions from each locus by using formula (13). Thus, in case of independent loci, and polymorphic sites in a sample of sequences at locus , , the maximum likelihood estimator of can be derived by solving the equationwith respect to , where is the derivative of with respect to . It is assumed that the scaled mutation rate and the scaled population size at locus are known.

To estimate a % confidence interval of , the Central Limit Theorem based approximation can be used when the following conditions hold: (1) The number of the loci is large; (2) the loci are on same type of chromosomes (as a result ); (3) samples of DNA sequences from each locus have the same size (); (4) the lengths of the sequences from the loci are equal . Thus, the % confidence interval of can be computed aswhere is the critical value from the standard normal distribution; is observed Fisher information, which can be computed using the formulaFor evaluating the above expression, numerically stable expressions for the first and second derivatives of with respect to can be derived by using (13) and the numerical stabilization procedure.

Application of the method for inferring recent homogenization events from human genome

Anthropological and archeological data strongly support “Out of Africa” hypothesis for the origin and evolutionary history of modern humans [15]–[19]. The hypothesis underlies two major events: Homo sapiens (ancestors of modern humans) emerged in Africa between 150,000 and 200,000 years ago (kya) and dispersed to other regions of the world sometimes before 50,000 years before present (yr B.P.). Studies based on mitochondrial and Y-chromosome support this hypothesis [20]–[28]. However, studies based on DNA sequence data from coding and non-coding loci on autosomal and X-chromosome show that the most recent common ancestor of -globin gene [29], the X chromosome gene for the pyruvate dehydrogenase E1 -subunit [30], and the non-coding loci 22q11.2 [31], 17q23 [32], Xq13.3 [33] are much older than 200,000 yr B.P. These inferences are based on the framework of the standard coalescent, in which the effective human population size and the mutation rate per nucleotide site per generation are considered to be 10000 [34] and [35]–[37], respectively.

In contrast to this approach, I use the framework developed in this paper to analyze some of data sets used in the studies mentioned above. I apply the framework to DNA sequences from four non-coding loci (22q11.2, 17q23, Xq13.3, YAP) in low recombination regions on chromosomes 22, 17, X, and Y to identify and differentiate recent homogenization events associated with the “Out of Africa” hypothesis. The data sets are published in [25], [31]–[33], respectively, and their summary is in Table 1. First, I consider commonly accepted estimates for values of the parameters in the model: the effective human population size to be 10000; the mutation rate per nucleotide site per generation to be ; the human (effective) generation time to be 20 years. Mutation rate per generation per sequence at each locus is computed as , where is the length of the DNA sequences at the locus. After applying Test I for this set of parameter values to each of the four data sets, the power of detecting a recent homogenization event at any of the loci is very weak (the -values close to 1, data not shown). In this case the maximum likelihood estimates for the elapsed times of recent homogenization events at the loci are much older than 200,000 yr B.P. (Table 2). Thus, these estimates disagree with the “Out of Africa” hypothesis.

Download:

Table 1. Summary of the DNA sequence data sets from loci 22q11.2, 17q23, Xq13.3, and YAP.

https://doi.org/10.1371/journal.pone.0037588.t001

Download:

Table 2. Estimates for the elapsed times

since a recent homogenization event for each of the four loci.

https://doi.org/10.1371/journal.pone.0037588.t002

To explore another possibility, I also consider human effective population size to be 50000 based on the following observations: (1) Some studies [38]–[40] estimated effective human population size to be a few times larger than 10000. (2) Maximums of the likelihood functions of the data sets favor the case over the case for all the data sets. Thus, I consider the values of and to be the same as above but . After applying Test I to the data sets from each locus, the likelihood-ratio tests rejected the null hypotheses at significance level, the results are in Table 3. Clearly, the results suggest that the standard coalescent framework is inadequate to describe the data sets for this set of parameter values, and recent homogenization events have impact on the four loci. The maximum likelihood estimates (see Table 2) of the elapsed times agree with the times for the two major events.

Download:

Table 3. The values of minus twice of the log of likelihood-ratio statistics for the data sets from each of the four loci.

https://doi.org/10.1371/journal.pone.0037588.t003

In this case, the likelihood functions of the data sets would not change dramatically as gets larger than 50000 because they behave in large regime. The maximum likelihood estimates of , when , are in Table 2. These estimates show that considering the human effective population size greater than 50,000, the estimates for the elapsed times would not change dramatically.

For this set of parameter values, I use Test III to differentiate the recent homogenization events at the four loci. The results of the tests are in Table 4. These results suggest that the four loci have not been affected by the same homogenization event, -values are less than 0.05 for the data sets from African and Non-African populations. The locus Xq13.3 is significantly younger than the locus 17q23, in particular for Non-African population, which suggest that the locus Xq13.3 has been affected by a recent positive selection or a recent bottleneck occurred to Non-African female population. Using Tajima's test [41] and Fu's and Li's tests [42], Zhao at el. [31] also observed that the diversity at locus Xq13.3 significantly deviates from the Wright-Fisher neutral model.

Download:

Table 4. The values of minus twice of log of likelihood-ratio statistics for Test III.

https://doi.org/10.1371/journal.pone.0037588.t004

Application of the method for inferring the times of HIV seroconversions in HIV-1-infected patients

Usually, after few weeks of HIV infection, plasma viraemia in infected patient declines rapidly as a result of a primary immune response, which coincides with HIV seroconversion [43], [44]. In particular, HIV envelop gene at this time point shows no diversity [45]. To examine the utilities of the framework developed in this paper, I use DNA sequence data from HIV-1 envelop genes published in [46] to infer the times of HIV seroconversions in nine HIV-1-infected patients. The sequences are sampled from the patients at the first HIV-positive screening tests. The sequences are 650 nucleotide long; a summary of the data is in Table 5.

Download:

Table 5. Summary of Shankarappa et al's [46] data.

https://doi.org/10.1371/journal.pone.0037588.t005

For consistency of the data sets with the infinite-sites mutation model and with no intra-locus recombination, the following conditions are checked: (a) Each polymorphic site is a result of a single mutation event, that is only two nucleotide states are possible at each polymorphic site. (b) All pairs of sites in sample of DNA sequences pass the four-gamete test [47]–[49]. Seven of the nine data sets (except data sets from patients 2 and 5) satisfy conditions (a) and (b). The data sets from patients 2 and 5 are inconsistent with the conditions (a) and (b), respectively. However, the two data sets are not excluded from the analysis because inconsistencies in these data sets are a result of two mutations and some recombination events, respectively.

I consider the following values for the parameters in the model: population size equal to the viral load at the sampling time point, mutation rate per nucleotide site per generation equal to , the number of nucleotides at the locus is equal to 650. All the insertions and deletions are excluded from the analysis. For this set of parameter values, I applied Test I to the data from each patient; the null hypotheses are rejected at 5% significance level in favor of recent homogenisation events. For each patient the maximum likelihood and 95% confidence interval estimates of (in coalescent units) are in Table 6. These estimates can be converted in years by using the equation , in which the effective HIV generation time is considered to be equal to 1 or 2 days [50]–[52].

Download:

Table 6. The estimates of the seroconversion times (

in coalescence units) in the nine patients.

https://doi.org/10.1371/journal.pone.0037588.t006

These estimates are contrasted with the estimates provided by Shankarappa et. al [46]; they estimated the time of HIV seroconversion for each of the patients as the mid-time point between the last HIV-negative and first HIV-positive screening tests. The comparison between the estimates (see Figure 2) shows that for some of the data sets the estimates are significantly in disagreement.

Download:

Figure 2. Comparison of two estimates of the seroconversion time for each of the nine patients.

The effective generation time in (A) and (B) are considered to be day and days, respectively. Maximum likelihood and 95% confidence interval estimates of the time of HIV seroconversion in years since the first HIV positive screening test are shown in full dots and error bars, respectively. Empty circles represent the mid-point estimates of the seroconversion times [46].

https://doi.org/10.1371/journal.pone.0037588.g002

The observed disagreements are robust with respect to (data not shown): when is larger than the viral load, the likelihood functions do not change because of large regime. I have also applied the above estimation method by considering equal to the one tenth of the viral loads. The result show that the observed discrepancies also hold for this case. Note that the viral load represents approximately of the total amount of the virus in an HIV-infected person since there is a total of 5 liters of blood in the body of an average adult.

Discussion

The analytical method developed in this paper is a trade-off between computational efficiency and complexity of the underlying evolutionary model. Using multi-locus DNA sequence data, the method allows identification and differentiation of the signatures of recent severe bottleneck and hitchhiking effects in a computationally efficient way. However, the method uses the number of polymorphic sites instead of full polymorphism data in samples of DNA sequences, and it is constrained by the assumptions of the constant size Wright-Fisher reproduction model and the infinite-sites model. In contrast, coalescent based simulation methods can be implemented at the cost of computational feasibility to include full polymorphism data [7], various demographic scenarios [6], and finite-sites mutation models [53]. However, before using computationally more expensive methods, the method could be a helpful guide for analyzing multi-locus DNA sequences data.

To illustrate the behavior of the likelihood function for small and large , I used the program to plot the likelihood function of (see Figure 3) for a sample of 15 DNA sequences with 25 polymorphic sites when and . The behavior of the likelihood function can be explained based on the process that traces ancestral lineages of the sample back in time. When tracing lineages back in time, coalescent and mutation events occur one at a time with rates and , respectively. Thus, when is large, mutation events occur more often than coalescent events back in time, so for a given number of polymorphic sites the recent homogenization event is more likely to be before the most recent common ancestor of the sample. This also explains the approximation (14). In opposite to this, when is small, the sample is more likely to have the most recent common ancestor before the recent homogenization event. Similarly, as gets larger the sample is more likely to have the most recent common ancestor before the homogenization event, hence the likelihood function has a limit (see (15)).

Download:

Figure 3. The likelihood function of

for two values of . For a sample of 15 DNA sequences with 25 polymorphic sites at a locus, the likelihood function of the elapsed time is plotted for the values of and in red and blue, respectively.

https://doi.org/10.1371/journal.pone.0037588.g003

Computational efficiency of the method gives an advantage to explore various values for the parameters in the model for assessing the impact of parameter values on the inference. The application of the method to the human data shows that when the effective human population size is equal to 10000 or greater than 50000, the inferences about evolutionary history of modern human populations are dramatically different. The HIV data analysis shows that the observed discrepancies between estimates for HIV seroconversions in the patients can be a result of the assumption that the effective HIV generation time is the same for all the patients. To have a better assessment for this assumption, frequent HIV screening tests can be used to assess the times of HIV seroconversion in HIV patients, and then to apply this method for exploring variability of effective HIV-1 generation times between HIV patients.

As the analysis of the human DNA sequences data shows the method developed in this paper does not have enough power to give an estimate for the effective human population size. Although the method suggest that very large values of as maximum likelihood estimates for some of the human data sets when and are considered unknown, this does not mean that the “simple” model () is an appropriate model for explaining the data sets because site frequency spectrum of a sample of DNA sequences under the simple model consists only singletons, and Zhao et al. [31] observed excess number of singletons and doubletons for all the data sets. Note that under the model considered in this paper the behavior of the expected site frequency spectrum in samples of DNA sequences changes continuously respect to , for example when the effective population size changes continuously. The two extreme ends of the expected site frequency spectrum under this model are described by the standard coalescent and by the “simple” model, respectively for small and very large values of . Under the standard coalescent the expected site frequency spectrum represents a wide range for frequencies of alleles. Thus, as () increases the expected number of low-frequency alleles increases.

Supporting Information

Text S1.

https://doi.org/10.1371/journal.pone.0037588.s001

(PDF)

Acknowledgments

I would like to thank the anonymous reviewer #1 for his/her helpful suggestions. This work is dedicated to Professor Simon Tavaré on the occasion of his 60th birthday.

Author Contributions

Conceived and designed the experiments: OS. Performed the experiments: OS. Analyzed the data: OS. Contributed reagents/materials/analysis tools: OS. Wrote the paper: OS.

References

1. Smith JM, Haigh J (1974) The hitch-hiking effect of a favourable gene. Genet Res 23: 23–35.
- View Article
- Google Scholar
2. Griffiths R (1981) Transient distribution of the number of segregating sites in a neutral infinite-sites model with no recombination. J Appl Prob 18: 42–51.
- View Article
- Google Scholar
3. Tajima F (1989) The effect of change in population size on DNA polymorphism. Genetics 123: 597–601.
- View Article
- Google Scholar
4. Perlitz M, Stephan W (1997) The mean and variance of the number of segregating sites since the last hitchhiking event. J Math Biol 36: 1–23.
- View Article
- Google Scholar
5. Watterson GA (1975) On the number of segregating sites in genetical models without recombination. Theoretical Population Biology 7: 256–76.
- View Article
- Google Scholar
6. Jakobsson M, Hagenblad J, Tavaré S, Säll T, Halldén C, et al. (2006) A unique recent origin of the allotetraploid species Arabidopsis suecica: Evidence from nuclear DNA markers. Mol Biol Evol 23: 1217–31.
- View Article
- Google Scholar
7. Galtier N, Depaulis F, Barton NH (2000) Detecting bottlenecks and selective sweeps from DNA sequence polymorphism. Genetics 155: 981–7.
- View Article
- Google Scholar
8. Kingman JFC (1982) On the genealogy of large populations. Journal of Applied Probability 19A: 27–43.
- View Article
- Google Scholar
9. Kingman JFC (1982) The coalescent. Stochastic Processes and their Applications 13: 235–48.
- View Article
- Google Scholar
10. Kingman JFC (1982) Exchangeability and the evolution of large populations. In: Koch G, Spizzichino F, editors. Exchangeability in Probability and Statistics, North Holland Publishing Company. pp. 97–112.
11. Hudson R (1983) Testing the constant-rate neutral allele model with protein sequence data. Evolution 37: 203–17.
- View Article
- Google Scholar
12. Tajima F (1983) Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–60.
- View Article
- Google Scholar
13. Wolfram Research, Inc. (2007) Mathematica. Version 6.0, Champaign, IL.
14. Tavaré S (1984) Line-of-descent and genealogical processes, and their applications in population genetics models. Theor Popul Biol 26: 119–64.
- View Article
- Google Scholar
15. Stringer C (2002) Modern human origins: progress and prospects. Phil Trans R Soc Lond B 357: 563–79.
- View Article
- Google Scholar
16. Mellars P (2004) Neanderthals and the modern human colonization of Europe. Nature 432: 461–5.
- View Article
- Google Scholar
17. Forster P (2004) Ice ages and the mitochondrial DNA chronology of human dispersals: a review. Phil Trans R Soc Lond B 359: 255–64.
- View Article
- Google Scholar
18. Forster P, Matsumura S (2005) Did early humans go north or south? Science 308: 965–6.
- View Article
- Google Scholar
19. Mellars P (2006) A new radiocarbon revolution and the dispersal of modern humans in Eurasia. Nature 439: 931–5.
- View Article
- Google Scholar
20. Cann R, Stoneking M, Wilson A (1987) Mitochondrial DNA and human evolution. Nature 325: 31–6.
- View Article
- Google Scholar
21. Ingman M, Kaessmann H, Pääbo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408: 708–13.
- View Article
- Google Scholar
22. Maca-Meyer N, Gonzalez A, Larruga J, Flores C, Cabrera V (2001) Major genomic mitochondrial lineages delineate early human expansions. BMC Genet 2: 13.
- View Article
- Google Scholar
23. Vigilant L, Stoneking M, Harpending H, Hawkes K, Wilson A (1991) African populations and the evolution of human mitochondrial DNA. Science 253: 1503–7.
- View Article
- Google Scholar
24. Pakendorf B, Stoneking M (2005) Mitochondrial DNA and human evolution. Annu Rev Genomics Hum Genet 6: 165–83.
- View Article
- Google Scholar
25. Hammer MF (1995) A recent common ancestry for Human Y chromosomes. Nature 378: 376–8.
- View Article
- Google Scholar
26. Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 16: 1791–8.
- View Article
- Google Scholar
27. Thomson R, Pritchard JK, Shen P, Oefner PJ, Feldman MW (2000) Recent common ancestry of human Y chromosomes: Evidence from DNA sequence data. Proc Natl Acad Sci USA 97: 7360–5.
- View Article
- Google Scholar
28. Jobling M, Tyler-Smith C (2003) The human Y chromosome: an evolutionary marker comes of age. Nature Rev Genet 4: 598–612.
- View Article
- Google Scholar
29. Harding RM, Fullerton SM, Griffiths RC, Bond J, Cox MJ, et al. (1997) Archaic African and Asian lineages in the genetic ancestry of modern humans. Am J Hum Genet 60: 772–89.
- View Article
- Google Scholar
30. Harris EE, Hey J (1999) X chromosome evidence for ancient human histories. Proc Natl Acad Sci USA 96: 3320–4.
- View Article
- Google Scholar
31. Zhao Z, Jin L, Fu YX, Ramsay M, Jenkins T, et al. (2000) Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22. PNAS 97: 11354–8.
- View Article
- Google Scholar
32. Rieder MJ, Taylor SL, Clark AG, Nickerson DA (1999) Sequence variation in the human angiotensin converting enzyme. Nat Genet 22: 59–62.
- View Article
- Google Scholar
33. Kaessmann H, Heißig F, von Haeseler A, Pääbo S (1999) DNA sequence variation in a non-coding region of low recombination on the human X chromosome. Nat Genet 22: 78–81.
- View Article
- Google Scholar
34. Takahata N (1993) Allelic genealogy and human evolution. Mol Biol Evol 10: 2–22.
- View Article
- Google Scholar
35. Kondrashov AS, Crow JF (1993) A molecular approach to estimating the human deleterious mutation rate. Hum Mutat 2: 229–34.
- View Article
- Google Scholar
36. Drake JW, Charlesworth B, Charlesworth D, Crow JF (1998) Rates of spontaneous mutation. Genetics 148: 1667–86.
- View Article
- Google Scholar
37. Nachman MW, Crowell SL (2000) Estimate of the mutation rate per nucleotide in humans. Genetics 156: 297–304.
- View Article
- Google Scholar
38. Ayala FJ (1996) HLA sequence polymorphism and the origin of humans. Science 274: 1554.
- View Article
- Google Scholar
39. Ayala FJ (1995) The myth of Eve: molecular biology and human origins. Science 270: 1930–6.
- View Article
- Google Scholar
40. Sherry ST, Harpending HC, Batzer MA, Stoneking M (1997) Alu evolution in human populations: using the coalescent to estimate effective population size. Genetics 147: 1977–82.
- View Article
- Google Scholar
41. Tajima F (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–95.
- View Article
- Google Scholar
42. Fu YX, Li WH (1993) Statistical tests of neutrality of mutations. Genetics 133: 693–709.
- View Article
- Google Scholar
43. Weber J (2001) The pathogenesis of HIV-1 infection. Br Med Bull 58: 61–72.
- View Article
- Google Scholar
44. Ariyoshi K, Harwood E, Chiengsong-Popov R, Weber J (1992) Is clearance of HIV-1 viraemia at seroconversion mediated by neutralising antibodies? The Lancet 340: 1257–8.
- View Article
- Google Scholar
45. Holmes EC, Zhang L, Simmonds P, Ludlam C, et al. (1992) Convergent and divergent sequence evolution in the surface envelope glycoprotein of human immunodeficiency virus type 1 within a single infected patient. Proc Natl Acad Sci USA 89: 4835–9.
- View Article
- Google Scholar
46. Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, et al. (1999) Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. Journal of Virology 73: 10489–502.
- View Article
- Google Scholar
47. Buneman P (1971) The recovery of trees from measures of dissimilarity. In: Hodson FR, Kendall D, Tautu P, editors. Mathematics in the Archaeological and Historical Sciences, Edinburgh University Press. pp. 387–95.
48. Hudson R, Kaplan N (1985) Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147–64.
- View Article
- Google Scholar
49. Gusfield D (1991) Efficient algorithms for inferring evolutionary trees. Networks 21: 19–28.
- View Article
- Google Scholar
50. Perelson AS, Neumann AU, Markowitz M, Leonard JM, Ho DD (1996) HIV-1 dynamics in vivo: Virion clearance rate, infected cell life-span, and viral generation time. Science 271: 1582–6.
- View Article
- Google Scholar
51. Perelson AS, Nelson PW (1999) Mathematical analysis of HIV-1 dynamics in vivo. SIAM Rev 41: 3–44.
- View Article
- Google Scholar
52. Rodrigo AG, Shpaer EG, Delwart EL, Iversen AK, Gallo MV, et al. (1999) Coalescent estimates of HIV-1 generation time in vivo. Proc Natl Acad Sci USA 96: 2187–91.
- View Article
- Google Scholar
53. Yang Z (1996) Statistical properties of a DNA sample under the finite-sites model. Genetics 144: 1941–50.
- View Article
- Google Scholar

[ref1] 1. Smith JM, Haigh J (1974) The hitch-hiking effect of a favourable gene. Genet Res 23: 23–35.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Griffiths R (1981) Transient distribution of the number of segregating sites in a neutral infinite-sites model with no recombination. J Appl Prob 18: 42–51.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Tajima F (1989) The effect of change in population size on DNA polymorphism. Genetics 123: 597–601.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Perlitz M, Stephan W (1997) The mean and variance of the number of segregating sites since the last hitchhiking event. J Math Biol 36: 1–23.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Watterson GA (1975) On the number of segregating sites in genetical models without recombination. Theoretical Population Biology 7: 256–76.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Jakobsson M, Hagenblad J, Tavaré S, Säll T, Halldén C, et al. (2006) A unique recent origin of the allotetraploid species Arabidopsis suecica: Evidence from nuclear DNA markers. Mol Biol Evol 23: 1217–31.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Galtier N, Depaulis F, Barton NH (2000) Detecting bottlenecks and selective sweeps from DNA sequence polymorphism. Genetics 155: 981–7.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Kingman JFC (1982) On the genealogy of large populations. Journal of Applied Probability 19A: 27–43.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Kingman JFC (1982) The coalescent. Stochastic Processes and their Applications 13: 235–48.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Kingman JFC (1982) Exchangeability and the evolution of large populations. In: Koch G, Spizzichino F, editors. Exchangeability in Probability and Statistics, North Holland Publishing Company. pp. 97–112.

[ref11] 11. Hudson R (1983) Testing the constant-rate neutral allele model with protein sequence data. Evolution 37: 203–17.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref12] 12. Tajima F (1983) Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–60.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref13] 13. Wolfram Research, Inc. (2007) Mathematica. Version 6.0, Champaign, IL.

[ref14] 14. Tavaré S (1984) Line-of-descent and genealogical processes, and their applications in population genetics models. Theor Popul Biol 26: 119–64.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref15] 15. Stringer C (2002) Modern human origins: progress and prospects. Phil Trans R Soc Lond B 357: 563–79.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref16] 16. Mellars P (2004) Neanderthals and the modern human colonization of Europe. Nature 432: 461–5.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref17] 17. Forster P (2004) Ice ages and the mitochondrial DNA chronology of human dispersals: a review. Phil Trans R Soc Lond B 359: 255–64.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref18] 18. Forster P, Matsumura S (2005) Did early humans go north or south? Science 308: 965–6.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref19] 19. Mellars P (2006) A new radiocarbon revolution and the dispersal of modern humans in Eurasia. Nature 439: 931–5.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref20] 20. Cann R, Stoneking M, Wilson A (1987) Mitochondrial DNA and human evolution. Nature 325: 31–6.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref21] 21. Ingman M, Kaessmann H, Pääbo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408: 708–13.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref22] 22. Maca-Meyer N, Gonzalez A, Larruga J, Flores C, Cabrera V (2001) Major genomic mitochondrial lineages delineate early human expansions. BMC Genet 2: 13.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref23] 23. Vigilant L, Stoneking M, Harpending H, Hawkes K, Wilson A (1991) African populations and the evolution of human mitochondrial DNA. Science 253: 1503–7.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref24] 24. Pakendorf B, Stoneking M (2005) Mitochondrial DNA and human evolution. Annu Rev Genomics Hum Genet 6: 165–83.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref25] 25. Hammer MF (1995) A recent common ancestry for Human Y chromosomes. Nature 378: 376–8.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref26] 26. Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 16: 1791–8.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref27] 27. Thomson R, Pritchard JK, Shen P, Oefner PJ, Feldman MW (2000) Recent common ancestry of human Y chromosomes: Evidence from DNA sequence data. Proc Natl Acad Sci USA 97: 7360–5.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref28] 28. Jobling M, Tyler-Smith C (2003) The human Y chromosome: an evolutionary marker comes of age. Nature Rev Genet 4: 598–612.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref29] 29. Harding RM, Fullerton SM, Griffiths RC, Bond J, Cox MJ, et al. (1997) Archaic African and Asian lineages in the genetic ancestry of modern humans. Am J Hum Genet 60: 772–89.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref30] 30. Harris EE, Hey J (1999) X chromosome evidence for ancient human histories. Proc Natl Acad Sci USA 96: 3320–4.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref31] 31. Zhao Z, Jin L, Fu YX, Ramsay M, Jenkins T, et al. (2000) Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22. PNAS 97: 11354–8.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref32] 32. Rieder MJ, Taylor SL, Clark AG, Nickerson DA (1999) Sequence variation in the human angiotensin converting enzyme. Nat Genet 22: 59–62.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref33] 33. Kaessmann H, Heißig F, von Haeseler A, Pääbo S (1999) DNA sequence variation in a non-coding region of low recombination on the human X chromosome. Nat Genet 22: 78–81.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref34] 34. Takahata N (1993) Allelic genealogy and human evolution. Mol Biol Evol 10: 2–22.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref35] 35. Kondrashov AS, Crow JF (1993) A molecular approach to estimating the human deleterious mutation rate. Hum Mutat 2: 229–34.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref36] 36. Drake JW, Charlesworth B, Charlesworth D, Crow JF (1998) Rates of spontaneous mutation. Genetics 148: 1667–86.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref37] 37. Nachman MW, Crowell SL (2000) Estimate of the mutation rate per nucleotide in humans. Genetics 156: 297–304.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref38] 38. Ayala FJ (1996) HLA sequence polymorphism and the origin of humans. Science 274: 1554.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref39] 39. Ayala FJ (1995) The myth of Eve: molecular biology and human origins. Science 270: 1930–6.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref40] 40. Sherry ST, Harpending HC, Batzer MA, Stoneking M (1997) Alu evolution in human populations: using the coalescent to estimate effective population size. Genetics 147: 1977–82.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref41] 41. Tajima F (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585–95.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref42] 42. Fu YX, Li WH (1993) Statistical tests of neutrality of mutations. Genetics 133: 693–709.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref43] 43. Weber J (2001) The pathogenesis of HIV-1 infection. Br Med Bull 58: 61–72.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref44] 44. Ariyoshi K, Harwood E, Chiengsong-Popov R, Weber J (1992) Is clearance of HIV-1 viraemia at seroconversion mediated by neutralising antibodies? The Lancet 340: 1257–8.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref45] 45. Holmes EC, Zhang L, Simmonds P, Ludlam C, et al. (1992) Convergent and divergent sequence evolution in the surface envelope glycoprotein of human immunodeficiency virus type 1 within a single infected patient. Proc Natl Acad Sci USA 89: 4835–9.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref46] 46. Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, et al. (1999) Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. Journal of Virology 73: 10489–502.
View Article
Google Scholar

[133] View Article

[134] Google Scholar

[ref47] 47. Buneman P (1971) The recovery of trees from measures of dissimilarity. In: Hodson FR, Kendall D, Tautu P, editors. Mathematics in the Archaeological and Historical Sciences, Edinburgh University Press. pp. 387–95.

[ref48] 48. Hudson R, Kaplan N (1985) Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147–64.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref49] 49. Gusfield D (1991) Efficient algorithms for inferring evolutionary trees. Networks 21: 19–28.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref50] 50. Perelson AS, Neumann AU, Markowitz M, Leonard JM, Ho DD (1996) HIV-1 dynamics in vivo: Virion clearance rate, infected cell life-span, and viral generation time. Science 271: 1582–6.
View Article
Google Scholar

[143] View Article

[144] Google Scholar

[ref51] 51. Perelson AS, Nelson PW (1999) Mathematical analysis of HIV-1 dynamics in vivo. SIAM Rev 41: 3–44.
View Article
Google Scholar

[146] View Article

[147] Google Scholar

[ref52] 52. Rodrigo AG, Shpaer EG, Delwart EL, Iversen AK, Gallo MV, et al. (1999) Coalescent estimates of HIV-1 generation time in vivo. Proc Natl Acad Sci USA 96: 2187–91.
View Article
Google Scholar

[149] View Article

[150] Google Scholar

[ref53] 53. Yang Z (1996) Statistical properties of a DNA sample under the finite-sites model. Genetics 144: 1941–50.
View Article
Google Scholar

[152] View Article

[153] Google Scholar

Figures

Abstract

Introduction

Methods

The population genetic model

Results

Probability distribution of the number of polymorphic sites in a sample of DNA sequences

Mean and variance of the number of polymorphic sites in a sample of DNA sequences

Three tests for identifying and differentiating recent homogenization events at multiple loci

Test I.

Test II.

Test III.

Inferring the time of a recent severe bottleneck event based on polymorphism data at multiple loci

Application of the method for inferring recent homogenization events from human genome

Application of the method for inferring the times of HIV seroconversions in HIV-1-infected patients

Discussion

Supporting Information

Text S1.

Acknowledgments

Author Contributions

References