Reader Comments

Post a new comment on this article

Major problem in the estimates of the rate of gene family extinctions

Posted by LaurentDuret on 27 Jul 2007 at 13:04 GMT

Gene families are groups of homologous genes that are likely to have highly similar functions. Differences in family size due to lineage-specific gene duplication and gene loss may provide clues to the evolutionary forces that have shaped mammalian genomes. Here we analyze the gene families contained within the whole genomes of human, chimpanzee, mouse, rat, and dog. In total we find that more than half of the 9,990 families present in the mammalian common ancestor have either expanded or contracted along at least one lineage. Additionally, we find that a large number of families are completely lost from one or more mammalian genomes, and a similar number of gene families have arisen subsequent to the mammalian common ancestor. Along the lineage leading to modern humans we infer the gain of 689 genes and the loss of 86 genes since the split from chimpanzees, including changes likely driven by adaptive natural selection. Our results imply that humans and chimpanzees differ by at least 6% (1,418 of 22,000 genes) in their complement of genes, which stands in stark contrast to the oft-cited 1.5% difference between orthologous nucleotide sequences. This genomic “revolving door” of gene gain and loss represents a large number of genetic differences separating humans from our closest relatives.
http://plosone.org/article/info:doi/10.1371/journal.pone.0000085#article1.front1.article-meta1.abstract1.p1

This paper addresses an important question: what is the rate and pattern of evolution of the gene repertoire in mammals ? Indeed, whereas the evolutionary forces shaping the rate of sequence evolution have been well studied, little is known about the frequency of gene losses or gene gains.

The problem of that paper is that the identification of gene losses and gene creations (or duplication) relies exclusively on the analysis of the content of Ensembl gene families. An Ensembl gene family that includes only human genes is considered as a gene family "creation" in the human branch. Conversely, a gene family that is present in chimpanzee and dog but that does not include any human gene is considered as being "extinct" in human.

The problem is that the absence of a given gene in an Ensembl family might correspond to different artefactual situations:

a- the gene exists but is located in a region that has not been sequenced (or correctly assembled) yet
b- the gene exists but has not been identified (annotated) yet
c- the gene exists but was not classified in the gene family because the clustering criteria (sequence similarity, length of the alignment) that were used to define Ensembl gene families were too stringent


Although the authors discuss these possible artefacts in their paper, I am not convinced when they claim that these artefacts should have little impact on their conclusion.

As a control for the reliability of their analyses I looked at the 49 gene families that were considered as having been lost in the human lineage ("extinctions" in their Table 2). I retrieved in the supplementary Table S2 all the gene families that contain at least one chimp sequence and one non-primate sequence but no human sequence. These 49 gene families are all represented by a single gene in chimp:

FID chimp human mouse rat dog
ENSF00000002436 1 0 24 9 0
ENSF00000002900 1 0 2 2 2
ENSF00000003534 1 0 1 1 1
ENSF00000003743 1 0 2 2 1
ENSF00000004000 1 0 1 1 1
ENSF00000004811 1 0 1 1 0
ENSF00000004836 1 0 1 1 1
ENSF00000004840 1 0 1 2 1
ENSF00000005029 1 0 1 1 1
ENSF00000005367 1 0 1 1 1
ENSF00000005368 1 0 1 1 1
ENSF00000005776 1 0 1 1 1
ENSF00000006245 1 0 1 1 1
ENSF00000006438 1 0 1 1 1
ENSF00000006709 1 0 1 1 0
ENSF00000006835 1 0 1 1 1
ENSF00000007144 1 0 1 1 1
ENSF00000007553 1 0 1 1 1
ENSF00000007676 1 0 2 2 1
ENSF00000007697 1 0 1 1 1
ENSF00000007845 1 0 1 2 1
ENSF00000007989 1 0 2 1 1
ENSF00000008030 1 0 1 1 1
ENSF00000008484 1 0 0 1 1
ENSF00000008589 1 0 1 1 1
ENSF00000008702 1 0 1 1 1
ENSF00000009151 1 0 1 1 1
ENSF00000009267 1 0 1 1 1
ENSF00000009414 1 0 1 1 1
ENSF00000009416 1 0 1 1 1
ENSF00000009499 1 0 1 0 0
ENSF00000009609 1 0 0 1 1
ENSF00000009610 1 0 1 1 0
ENSF00000009800 1 0 1 1 2
ENSF00000009884 1 0 1 1 1
ENSF00000009934 1 0 1 1 1
ENSF00000010085 1 0 1 1 1
ENSF00000010169 1 0 1 1 0
ENSF00000010256 1 0 1 1 1
ENSF00000010448 1 0 1 1 1
ENSF00000010502 1 0 1 1 0
ENSF00000010519 1 0 1 1 1
ENSF00000010549 1 0 1 1 1
ENSF00000010665 1 0 1 1 1
ENSF00000010678 1 0 1 1 2
ENSF00000011177 1 0 0 0 1
ENSF00000011186 1 0 0 1 1
ENSF00000011190 1 0 0 1 1
ENSF00000011513 1 0 1 1 1


Then I extracted the corresponding chimp protein from Ensembl release 41 using BioMart:

http://oct2006.archive.ensembl.org/Multi/martview

The 49 chimp genes correspond to 77 proteins (some genes encode alternative splice variants).

Then I downloaded all human proteins annotated in Ensembl release 41

ftp://ftp.ensembl.org/pub/release-41/homo_sapiens_41_36c/data/fasta/pep/


Finally, I BLASTed the 77 chimp proteins against the human proteome (Ensembl release 41): each of these chimp proteins has a very strong match in human : average identity (at the protein level) = 99%; minimum = 86%. Thus, none of these 49 gene families has been lost in the human lineage.

In conclusion, the rate of gene family extinction in the human lineage (Table 2) appears to be overestimated ... by a factor of 100%. It is likely that similar problems affect also the numbers given for other species.

Note that I am not saying that gene losses and gene gains are not important for species evolution. Demuth and colleague may well be correct when they say that the rates of evolution of the gene repertoire is high (or higher than had been appreciated). However, they do not have performed all the controls that would have been necessary to assess the reliability of their results.

Laurent Duret