Conceived and designed the experiments: DW JAE. Performed the experiments: JAE DW MW AH DBR SY MF JCV. Analyzed the data: JAE DW MW AH DBR SY. Contributed reagents/materials/analysis tools: JCV. Wrote the paper: JAE DW. Ideas and discussion: MF JCV. Built microbial genome database: MW. Analyzed sequences linked to RecA and RpoB clusters: DBR. Analysis of distributions of sequences in GOS data: AH.
The authors have declared that no competing interests exist.
Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species.
We designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) Expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the
Of the novel
During the last 30 years, technological advances in nucleic acid sequencing have led to revolutionary changes in our perception of the evolutionary relationships among all species as visualized in the
For microbial organisms, this approach was restricted to the minority that could be grown in pure culture in the laboratory until Norm Pace and colleagues showed that one could sequence rRNAs directly from environmental samples
PCR-based studies have now characterized microbes from diverse habitats and have provided many fundamental new insights into microbial diversity. For example, we now realize that, in most environments, the culturable microbes represent but a small fraction of those present. Furthermore, phylogenetic analysis of the rRNA genes thus found enables one to assign those sequences to groups within the bacterial, archaeal, or eukaryotic domains of life (or to viral groups), a process known as
Although rRNA PCR studies have provided a major foundation for today's environmental microbiology, this approach is not without its limitations. Notably, the “universal” primers are not truly universal. Even the best-designed ones fail to amplify the targeted genes in some lineages while preferentially amplifying those in others
Due to these factors, the community has faced a bit of a quandary regarding the characterization of uncultured organisms. Although rRNA analysis is extraordinarily powerful, the window it provides into the microbial world is clearly imperfect. It is possible that additional major branches in the tree of life might exist, branches that have been missed due to the limitations of rRNA PCR. To resolve this required ways to clone rRNA genes without the biases introduced by PCR, as well as unbiased methods for obtaining data on other genes from uncultured species. Fortunately, both are now provided by metagenomic analysis.
The application of metagenomic analysis has accelerated the rapid rate of advancement in the study of uncultured microbes that began with the advent of rRNA analysis (e.g.,
Previous usage of metagenomic data for phylogenetic typing of organisms focused primarily on assigning metagenomic sequences to specific known groups of organisms (e.g., see
We sought here to address a single question:
Since the largest metagenomic data sets available when this work was begun came from the Sorcerer II Global Ocean Sampling Expedition (GOS)
Using this approach, we examined the entire GOS data set of 14,689 putative ss-rRNA sequences and identified 18 sequences that met our multiple criteria for potentially being deep branching (JCVI reads: 1098241, 1092963341190, 1091140405652, 318, 1105333456790, 1103242712700, 1105499913772, 1103242587147, 1108829508267, 1092959443067, 1092405960359, 1092402545613, 1093018267888, 1095522122248, 1093018199876, 1092351161318, 1092381601933, 1095527007809). Most importantly, these 18 could not be assigned to any of the three domains by STAP and they are each positioned near a domain separation point in a maximum likelihood tree that includes representatives from all three domains. However, more detailed examination of those alignments and trees disclosed problems in the alignments for all of the candidate novel sequences. Some alignments were of low quality due to the insertion of too many gaps in the novel sequence. Alignment quality is critical to phylogenetic analysis because the alignment is a hypothesis concerning the homology (common ancestry) of the residues at the same position in each of the aligned sequences (
These difficulties served to confirm that the methods available (or at least the ones we were using for this high throughput approach) were not robust enough to identify novel ss-rRNA genes in an automated manner. Most of our problems were inherent in attempting to generate high quality alignments of short sequences that are only distantly related to known ss-rRNA genes. Furthermore, alignment of novel rRNA sequences can be challenging because often it is the secondary and tertiary structure of the molecule, rather than the primary sequence, that is highly conserved. Our attempts to improve the alignments based on
Even when one has high quality rRNA gene sequence alignments, phylogenetic analysis involving very deep branches in the tree of life can still be difficult due to inherent complications, such as convergent evolution due to GC content effects
Due to the difficulties discussed above, we turned to protein-coding genes in our search for novel branches in the tree of life. To take the place of the ss-rRNA genes, we needed a protein-coding gene that was both universal and widely studied. For our initial test we chose the
Our question became:
In total, 4677 RecA superfamily members were identified from published microbial genomes
Using this method, 23 clusters containing more than two protein sequences were identified (
All RecA sequences were grouped into clusters using the Lek algorithm. Representatives of each cluster that contained >2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RecA superfamily are shaded and given a name on the right. Five of the proposed subfamilies contained only GOS sequences at the time of our initial analysis (RecA-like SAR, Phage SAR1, Phage SAR2, Unknown 1 and Unknown 2) and are highlighted by colored shading. As noted on the tree and in the text, sequences from two Archaea that were released after our initial analysis group in the
Cluster ID | Corresponding Subfamily (see |
Corresponding Group in Lin |
Comments | GOS Only | Number of GOS Sequences |
1 | RecA | RecA | 2830 | ||
11 | RecA-like SAR1 | n/a | Novel | + | 10 |
5 | Phage SAR2 | n/a | Novel | + | 68 |
4 | Phage UvsX | n/a | 73 | ||
2 | Phage SAR1 | n/a | Found in cyanophage by subsequent sequencing | + | 824 |
15 | Unknown 1 | Novel | + | 6 | |
14 | XRCC3/SpB | Radb-XRCC3 | 0 | ||
20 | XRCC3/SpB | Radb-XRCC3 | 0 | ||
22 | Rad57 | Radb-XRCC2 | 0 | ||
6 | Rad51C | Radb-Rad51C | 1 | ||
8 | Rad51B | Radb-Rad51B | 2 | ||
10 | Rad51D | Radb-Rad51D | 0 | ||
16 | RadB | Radb-RadB | 0 | ||
17 | RadB | Radb-RadB | 0 | ||
21 | RadB | Radb-RadB | 0 | ||
12 | RadB | Radb-RadB | 0 | ||
3 | RadA/DMC1/Rad51 | Rada | 101 | ||
13 | RadA/DMC1/Rad51 | Rada | 0 | ||
Unknown 2 | n/a | Representatives found in Archaea by subsequent sequencing | + | 19 | |
18 | XRCC2 | Radb-XRCC2 | 0 | ||
RecA |
RecA | RecA fragment | + | 29 | |
RecA |
RecA | RecA fragment | + | 5 | |
RecA |
RecA | RecA fragment | + | 3 |
A Lek protein clustering method was applied to all RecA superfamily members retrieved from the NRAA database, microbial genomes, and the GOS data set. The 23 clusters containing more than two sequences are listed. Clusters that contain only sequences from the GOS data set are noted as “GOS only.” When a cluster can be mapped to a RecA subfamily identified by Lin
*These clusters of RecA fragments from the GOS data set were not included in the phylogenetic tree (
**Although cluster 9 contained only GOS sequences at the time of the initial analysis, it was subsequently found to include marine archaeal homologs from more recent genome sequencing projects.
Based on the clusters and the tree structure, we divided the RecA superfamily into the 15 major grouping labeled in the tree (
What do these novel RecA-related subfamilies and sequences represent? Given their high degree of sequence similarity to proteins in the RecA superfamily, all of which are known to play some role in homologous recombination, it is likely that the members of these new subfamilies are also involved in homologous recombination.
What can we say about the organisms that were the sources of these novel sequences? Two of the five novel subfamilies (
Subfamily | RecA Accession | Accession of Linked Gene | Assembly ID | Neighboring Gene Description | Taxonomy Assignment |
Phage-SAR1 | 1096700853217 | 1096700853219 | 1096627374158 | gp43 | Viruses/Phages |
Phage-SAR1 | 1096701673303 | 1096701673301 | 1096627382978 | T4-like DNA polymerase | Viruses/Phages |
Phage-SAR1 | 1096701673303 | 1096701673305 | 1096627382978 | T4-like DNA primase-helicase | Viruses/Phages |
Phage-SAR2 | 1096697847133 | 1096697847135 | 1096627014936 | GDP-mannose 4,6-dehydratase | Bacteria |
Phage-SAR2 | 1096697847133 | 1096697847149 | 1096627014936 | methyltransferase FkbM | Bacteria |
Unknown2 | 1096695533559 | 1096695533561 | 1096528150039 | ATP-dependent helicase | Archaea |
Unknown2 | 1096698308433 | 1096698308421 | 1096627021375 | ATP-dependent RNA helicase | Archaea |
Unknown2 | 1096698308433 | 1096698308423 | 1096627021375 | replication factor A | Archaea |
Unknown2 | 1096698308433 | 1096698308425 | 1096627021375 | S-adenosylmethionine synthetase | Bacteria |
Unknown2 | 1096698308433 | 1096698308427 | 1096627021375 | cobalt-precorrin-6A synthase | Archaea |
Unknown2 | 1096698308433 | 1096698308429 | 1096627021375 | NADH ubiquinone dehydrogenase | Bacteria |
Unknown2 | 1096698308433 | 1096698308431 | 1096627021375 | CbiG protein | Bacteria |
Unknown2 | 1096698308433 | 1096698308443 | 1096627021375 | ATP-binding protein of ABC transporter | Bacteria |
Unknown2 | 1096698308433 | 1096698308435 | 1096627021375 | chaperone protein dnaJ | Eukaryota |
Unknown2 | 1096698308433 | 1096698308445 | 1096627021375 | small nuclear riboprotein protein snRNP | Archaea |
Unknown2 | 1096699819041 | 1096699819039 | 1096627295379 | S-adenosylmethionine synthetase | Bacteria |
Unknown2 | 1096699819041 | 1096699819043 | 1096627295379 | replication factor A | Bacteria |
Unknown2 | 1096699819041 | 1096699819047 | 1096627295379 | snRNP Sm-like protein | Archaea |
Unknown2 | 1096686533379 | 1096686533339 | 1096627390330 | ATP-dependent helicase | Archaea |
Unknown2 | 1096686533379 | 1096686533341 | 1096627390330 | deoxyribodipyrimidine photolyase-related | Bacteria |
Unknown2 | 1096686533379 | 1096686533343 | 1096627390330 | Glycyl-tRNA synthetase alpha2 dimer | Archaea |
Unknown2 | 1096686533379 | 1096686533345 | 1096627390330 | RNA-binding protein | Bacteria |
Unknown2 | 1096686533379 | 1096686533347 | 1096627390330 | cobyrinic acid a,c-diamide synthase | Archaea |
Unknown2 | 1096686533379 | 1096686533349 | 1096627390330 | sdoxyribodipyrimidine photolyase | Archaea |
Unknown2 | 1096686533379 | 1096686533351 | 1096627390330 | DNA primase small subunit | Archaea |
Unknown2 | 1096686533379 | 1096686533353 | 1096627390330 | cobalt-precorrin-6A synthase | Archaea |
Unknown2 | 1096686533379 | 1096686533355 | 1096627390330 | cobalamin biosynthesis CbiG | Bacteria |
Unknown2 | 1096686533379 | 1096686533359 | 1096627390330 | DNA primase large subunit | Archaea |
Unknown2 | 1096686533379 | 1096686533361 | 1096627390330 | aldo/keto reductase | Bacteria |
Unknown2 | 1096686533379 | 1096686533365 | 1096627390330 | AP endonuclease | Archaea |
Unknown2 | 1096686533379 | 1096686533369 | 1096627390330 | ATP-dependent helicase | Archaea |
Unknown2 | 1096686533379 | 1096686533371 | 1096627390330 | translation initiation factor 2 alpha subunit | Archaea |
Unknown2 | 1096686533379 | 1096686533373 | 1096627390330 | translation initiation factor 2 alpha subunit | Archaea |
Unknown2 | 1096686533379 | 1096686533375 | 1096627390330 | sirohydrochlorin cobaltochelatase CbiXL | Bacteria |
Unknown2 | 1096686533379 | 1096686533377 | 1096627390330 | glutamate racemase | Bacteria |
Unknown2 | 1096686533379 | 1096686533383 | 1096627390330 | glycosyl transferase | Eukaryota |
Unknown2 | 1096686533379 | 1096686533387 | 1096627390330 | deoxyribodipyrimidine photolyase | Bacteria |
Unknown2 | 1096686533379 | 1096686533389 | 1096627390330 | AP endonuclease | Archaea |
Unknown2 | 1096686533379 | 1096686533393 | 1096627390330 | cbiC protein | Archaea |
Unknown2 | 1096686533379 | 1096686533399 | 1096627390330 | deoxyribodipyrimidine photolyase | Bacteria |
Unknown2 | 1096686533379 | 1096686533405 | 1096627390330 | cob(I)alamin adenosyltransferase | Bacteria |
Unknown2 | 1096686533379 | 1096686533407 | 1096627390330 | Phosphohydrolase | Bacteria |
Unknown2 | 1096686533379 | 1096686533409 | 1096627390330 | glycyl-tRNA synthetase | Archaea |
Unknown2 | 1096686533379 | 1096686533415 | 1096627390330 | 30S ribosomal protein S6 | Archaea |
Unknown2 | 1096686533379 | 1096686533421 | 1096627390330 | nuclease | Archaea |
Unknown2 | 1096686533379 | 1096686533423 | 1096627390330 | phosphohydrolase | Bacteria |
Unknown2 | 1096686533379 | 1096686533427 | 1096627390330 | cobalt-precorrin-3 methylase | Archaea |
Unknown2 | 1096686533379 | 1096686533429 | 1096627390330 | universal stress family protein | Bacteria |
Unknown2 | 1096686533379 | 1096686533473 | 1096627390330 | aryl-alcohol dehydrogenases related oxidoreductases | Eukaryota |
Unknown2 | 1096686533379 | 1096686533505 | 1096627390330 | snRNP Sm-like protein Chain A | Eukaryota |
Unknown2 | 1096689280551 | 1096689280549 | 1096627650434 | S-adenosylmethionine synthetase | Bacteria |
RecA-like SAR1 | 1096683378299 | 1096683378297 | 1096627289467 | DNA polymerase III alpha subunit | Bacteria |
Unknown1 | 1096694953057 | 1096694953059 | 1096520459783 | FKBP-type peptidyl-prolyl cis-trans isomerase | Archaea |
Unknown1 | 1096665977449 | 1096665977451 | 1096627520210 | single-stranded DNA binding protein | Viruses/Phages |
Unknown1 | 1096682182125 | 1096682182127 | 1096628394294 | DNA polymerase I | Bacteria |
Five RecA subfamilies were identified as being novel (i.e., only seen in metagenomic data) in our initial analyses. GOS metagenome assemblies that encode members of these subfamilies were identified and the genes neighboring the novel RecAs were characterized. The neighboring gene descriptions are based on the top BLASTP hits against the NRAA database; taxonomy assignments are based on their closest neighbor in phylogenetic trees built from the top NRAA BLASTP hits.
This GOS assembly (ID 1096627390330) encodes 33 annotated genes plus 16 hypothetical proteins, including several with similarity to known archaeal genes (e.g., DNA primase, translation initiation factor 2,
The
The results from the
In total, for further analysis we identified 1875 RpoB homologs from the GOS data set plus 784 known sequences from published microbial genomes
Nine of the 17 clusters contain only GOS sequences. Two of these (clusters 1 and 11) were determined to correspond to fragments of bacterial
Representatives were then selected from the remaining clusters and used to build the RpoB superfamily tree (
All RpoB sequences were grouped into clusters using the Lek algorithm. Representatives of each cluster that contained >2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RpoB superfamily are shaded and given a name on the right. The two novel RpoB clades that contain only GOS sequences are highlighted by the colored panels.
The largest number of homologs from the GOS data (1602 sequences) map to the Bacteria and Plastids RpoB clade, while the second largest number (181 sequences) group with the archaeal and eukaryotic clades. The relatedness of archaeal and eukaryotic RNA polymerases is consistent with previous observations
Two of the RpoB subfamilies include only GOS sequences:
That comparable results were obtained from both our
The ultimate question concerning the novel subfamilies that we found is what is their origin? Lacking both visual observation and/or complete genomes, we do not currently have an answer. One trivial possibility is that they are artifacts of some kind (see
Assuming the sequences are in fact real, we offer four possible biological explanations for their phylogenetic novelty. First, they could represent recombinants of some kind where domains from different known subfamilies have been mixed together to create a new form (e.g., perhaps the N-terminus of bacterial RecA was mixed with the C-terminus of a Rad51D). We consider this unlikely because the phylogenetic uniqueness for each group appears to be spread throughout the length of the proteins. A second possibility is that the novel sequences could represent paralogs resulting from ancient duplications within these gene families (and that these genes now reside in otherwise unexceptional, evolutionary lineages). We consider this extremely unlikely. Given the absence of representatives of these subfamilies from the sequenced genomes now available from dozens of the Eukaryota and Archaea and from hundreds of the Bacteria, this non-parsimonious explanation would require parallel gene loss of such ancient paralogs in most lineages in the tree of life, with gene retention in only a few organisms.
A third possibility is that the genes from novel subfamilies come from novel heretofore uncharacterized viruses. Given that the known viral world represents but a small fraction of the total extant diversity, and given some of the unexpected discoveries coming from viral genomics recently, this is entirely possible. For example, viruses have been characterized with markedly larger genomes that contain not only more genes, but genes previously found only in cellular organisms
It has not escaped our notice that the characteristics of these novel sequences are consistent with the possibility that they come from a new (i.e., fourth) major branch of cellular organisms on the tree of life. That is, their phylogenetic novelty could indicate phylogenetic novelty of the organisms from which they come. Clearly, confirmation or refutation of this possibility requires follow-up studies such as determining what is the source of these novel, deeply branching sequences (e.g., cellular organisms or viruses). Then, depending on the answers obtained, more targeted metagenomics or single-cell studies may help determine whether the novelty extends to all genes in the genome or is just seen for a few gene families.
Whatever the explanation for the novel sequences reported here, this discovery of new, deeply branching clades of housekeeping genes suggests that environmental metagenomics has the potential to provide striking insights into phylogenetic diversity, insights that complement those derived from rRNA studies. In the future we plan to explore more metagenomic data sets using an expanded collection of phylogenetic markers. Additional gene family classification and analysis tools, such as Markov clustering (MCL
A data set of 340 representative ss-rRNA sequences from all three domains was prepared. These sequences represented 134 eukaryotic, 186 bacterial, and 20 archaeal species. Alignments for these 340 sequences were extracted from the European Ribosomal RNA database
Homologs of RecA and RpoB were retrieved from the Genbank NRAA database (
The 522 RecA homologs retrieved from the GenBank NRAA database (
The same approach was used to cluster the 784 RpoB homologs from the NRAA database and published microbial genomes
For both the RecA and RpoB superfamily analysis, the cutoff values for the BLASTP search and the Lek clustering were chosen such that the clusters produced were reasonably comparable to the annotation of the sequences (e.g., RecAs in one cluster, Rad51 in another).
Representative amino acid sequences from each of the RecA and RpoB clusters were selected manually and then aligned by MUSCLE
Five RecA subfamilies (corresponding to sequences in clusters 2, 5, 9, 11, and15) contain only GOS sequences (i.e., they were novel metagenomic only subfamilies) and also contain complete genes (i.e., they were not made up of only sequence fragments). In total, these clusters contain 24 metagenomic RecA homologs. We examined the 24 GOS assemblies that encode these RecA homologs. From these we retrieved 559 putative protein-encoding genes. Of these 24 assemblies, 12 contained a combined total of 55 genes with BLASTP hits in the NRAA database (E-value cutoff of 1e-5). We assigned gene functions to the 55 genes based on their top BLASTP hits. For each of these 55 genes, a phylogenetic tree was built by QuickTree
Assembly 1096627390330, the largest of the 12 assemblies, was analyzed further. Translation in all six frames yielded 114 potential ORFs. Functions could be assigned to 33 of the 114 based on similarity to genes in the NRAA database using BLASTP. A gene map (
We've made the following data and protocols available for the public: (1) GOS and reference sequences for RecA and RpoB; (2) Subfamilies of RecA and RpoB (
Cluster ID | Corresponding Subfamily (see |
Comments | GOS Only? | Number of GOS Sequences |
7 | Bacteria and Plastids | 1602 | ||
4 | Bacteria and Plastids | 0 | ||
12 | Bacteria and Plastids | 0 | ||
8 | Unknown 1 | + | 4 | |
6 | Killer Plasmids” | 0 | ||
17 | Rpa2/Rpb2/Rpc2/Archaea | Includes most eukaryotic (nuclear) and archaeal superfamily members | 181 | |
2 | Rpa2 | 0 | ||
14 | Archaea | 0 | ||
3 | Unknown 2 | + | 3 | |
13 | Pox Viruses | 0 | ||
n/a | Partial sequences likely from bacteria | + | 6 | |
n/a | Partial sequences likely from bacteria | + | 2 | |
n/a | Partial sequences likely from eukaryotes. | + | 4 | |
n/a | Partial sequences likely from eukaryotes. | + | 4 | |
n/a | Partial sequences likely from eukaryotes. | + | 3 | |
n/a | Partial sequences likely from eukaryotes. | + | 5 | |
n/a | Not analyzed further because only two representatives identified | + | 2 |
A Lek clustering method was applied to all RpoB superfamily members retrieved from the NRAA database, microbial genome projects, and the GOS data set. Clusters that contain only sequences from the GOS data set are noted as “From GOS only.”
*Clusters 1, 9, 10, 11, 15, and 16 contain only sequence fragments from the GOS data set; though possibly novel they were omitted from further analysis.
**Cluster 5 contains only two sequences. Though both are from the GOS (IDs 1096695464231 and 1096681823525) and may represent a novel RpoB subfamily, this group was excluded from further analysis because we restricted analyses to groups with three or more sequences.
We acknowledge Jonathan Badger for help with informatics, Merry Youle for help with manuscript editing, past and present members of the Eisen Lab, TIGR, and JCVI for helpful discussions, and the reviewers and editors for helpful comments.