Development and use of primer sets to amplify nucleic acid sequences of interest is fundamental to studies spanning many life science disciplines. As such, the validation of primer sets is essential. Several computer programs have been created to aid in the initial selection of primer sequences that may or may not require multiple nucleotide combinations (i.e., degeneracies). Conversely, validation of primer specificity has remained largely unchanged for several decades, and there are currently few available programs that allows for an evaluation of primers containing degenerate nucleotide bases. To alleviate this gap, we developed the program De-MetaST that performs an in silico amplification using user defined nucleotide sequence dataset(s) and primer sequences that may contain degenerate bases. The program returns an output file that contains the in silico amplicons. When De-MetaST is paired with NCBI’s BLAST (De-MetaST-BLAST), the program also returns the top 10 nr NCBI database hits for each recovered in silico amplicon. While the original motivation for development of this search tool was degenerate primer validation using the wealth of nucleotide sequences available in environmental metagenome and metatranscriptome databases, this search tool has potential utility in many data mining applications.
Citation: Gulvik CA, Effler TC, Wilhelm SW, Buchan A (2012) De-MetaST-BLAST: A Tool for the Validation of Degenerate Primer Sets and Data Mining of Publicly Available Metagenomes. PLoS ONE 7(11): e50362. doi:10.1371/journal.pone.0050362
Editor: David L. Kirchman, University of Delaware, United States of America
Received: June 13, 2012; Accepted: October 24, 2012; Published: November 26, 2012
Copyright: © 2012 Gulvik et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: TCE was supported by an REU award from the National Science Foundation (NSF) (MCB1112001 to SWW). SWW and AB acknowledge NSF award OCE1061352 for support. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
PCR is one of the most fundamental and powerful molecular biology tools available. PCR primer sets that contain degenerate bases allow for the amplification of homologous sequences and have been used in various applications, including genetic diversity analyses (e.g., –). Several software packages that use a nucleotide or amino acid alignment of the genetic target are available to aid in the initial development of degenerate primer sets (e.g., Amplicon , CODEHOP –, DEFOG , DePiCt , HYDEN , MAD-DPD , PhiSiGns , and Primaclade ). In addition, manual identification of conserved regions from aligned sequences generated using software such as ARB , ClustalX , and MEGA  is also common practice (e.g., –). Once candidate primers are developed, thermodynamic properties and self-complementarity tests can be obtained using online tools (e.g., OligoCalc ).
Despite the utility and common use of degenerate primers, there are no software programs specifically designed to facilitate validation of their specificity. The most common practice for initial validation of degenerate primers is by direct sequence analysis of PCR amplicons (e.g., –). This can be both laborious and costly, and does not take advantage of the ever-increasing publicly available nucleotide data, including that derived from natural samples. In fact, environmental metagenomes and metatranscriptomes are especially attractive reference databases (e.g., CAMERA  [http://camera.calit2.net/] and MG-RAST [http://metagenomics.anl.gov/]) to perform in silico tests en masse to identify sequences a degenerate primer set might amplify.
To address this gap in available bioinformatic tools, we have developed a program termed De-MetaST. This program accepts primers that are degenerate using a meta-genome and –transcriptome search tool to retrieve in silico PCR amplicons. When paired with BLAST , the output provides the most homologous sequences in GenBank for each recovered in silico amplicon. In this report, we provide an overview of the program and outline its utility as a tool to validate the specificity of degenerate primer sets. This program is designed to be user-friendly for non-bioinformatics specialists and is publicly available; as are screencast video tutorials demonstrating installation and implementation.
Design and Program Overview
De-MetaST is written in C++ and is provided as an executable wrapper to include BLAST (De-MetaST-BLAST) as well as an independent executable (De-MetaST). The function of De-MetaST is to implement a search routine based on bitwise comparisons. Initial steps translate the degenerate nucleotide sequences of each primer, as well as their reverse complement sequences, into unique and specific binary representations. This approach facilitates rapid searches of large databases that are also transformed into binary representations. The specific computational steps of De-MetaST are outlined in Figure S1.
Figure 1. De-MetaST transformation of nucleotide sequences into a binary representation.
The binary representation for each of the 16 possible nucleotide character inputs is shown in the upper box. The lower box provides an example of the transformation using a mock primer sequence. Spaced gaps are shown for instructional purposes and do not occur in the De-MetaST search routine.doi:10.1371/journal.pone.0050362.g001
How De-MetaST Works
The De-MetaST program initially converts the inputted primer sequences into 4-digit binary code, where the 16 possible combinations of nucleotides include: A, T, C, G, B, D, H, K, M, N (or X), R, S, V, W, and Y (Figure 1). Then, each sequence read within a user defined, FASTA formatted database is converted to 4-digit binary codes and scanned using a bitwise searching operation for the presence of both primer sequences in the appropriate orientation. Limited memory is necessary for this action because each sequence read is individually transformed to binary and immediately scanned for the presence of the primer sequences. The program searches using both the original user inputted primers as well as the reverse and complement of those sequences. This latter search is done to insure identification of target sequences regardless of whether the sense or antisense strand is represented by the database sequence read scanned. The search feature also allows a single primer to serve as both the forward and reverse primer. When primers identify their respective target(s) within a sequence read, the nucleotide sequence delimited by the two primers, termed the in silico amplicon, is retrieved. The primer(s) yielding each amplicon are reported in the output. De-MetaST is written to parse in silico amplicons >5000 bp into a separate FASTA formatted file that is not subject to BLASTx; users can modify this length restriction by editing the code. All in silico amplicons provided in the output represent the sense strand in a 5′ to 3′ orientation. Thus, when positive hits are made to reads representing antisense strands, the complement and reverse of those reads are generated. Any identifying features (e.g., unique read number) as well as the file name for each predicted hit is recovered. Although developed to accept degenerate primers, non-degenerate primers can also be input into De-MetaST. Furthermore, the nucleotide query database(s) themselves may contain sequence reads with degenerate or ambiguous nucleotides (e.g., N). Finally, De-MetaST accepts multiple primer sets as input; the in silico amplicons from each set are output into separate FASTA files. As De-MetaST accepts degeneracies in the input primer sequences, it requires absolute conservation in the target sequences; it does not allow for any mismatches between the primer sequence and target. In this way, the user controls the level of primer specificity.
Figure 2. Flowchart outlining De-MetaST-BLAST user actions and corresponding computational processes.
Fwd, Forward; Rev, Reverse; NCBI, National Center for Biotechnology Information.doi:10.1371/journal.pone.0050362.g002
De-MetaST Paired with BLAST
Once the database sequence files have been queried for predicted PCR amplicons, each in silico amplicon is subject to a BLASTx analysis, which translates the nucleotide sequence in all six frames and performs queries for each translation against the non-redundant (nr) NCBI protein database. The top 10 BLASTx hits for each amplicon are formatted as an XML file. The final step of De-MetaST-BLAST compiles all of the meta-information of the BLASTx results for each amplicon retrieved (e.g., hit accession number, E-value, predicted function, nucleotide sequence, database file name, the primer combination that retrieved the amplicon, unique read number) into a single, tab-delimited TXT file. The BLASTx results file can also be exported as an XLS file format for direct use in Microsoft Excel or other suitable program. A graphical overview of the De-MetaST-BLAST workflow is shown in Figure 2.
Table 1. boxB and 16S rRNA gene in silico amplicons identified in representative metagenomes using De-MetaST-BLAST.doi:10.1371/journal.pone.0050362.t001
Figure 3. Example of De-MetaST-BLAST output.
Text within the box denotes the spreadsheet output for a boxB primer set search against the WASECA Farm Soil Metagenome (AAFX01000000)  that recovers two in silico amplicons. Column descriptors are shown in color; select columns have been truncated due to space constraints. For the “excision info” column, the first alphanumeric character reports the “hit” number within a read (i.e. “1” indicates it is the first in silico amplicon found within a single read). The subsequent alphanumeric characters denote the primer orientation yielding the amplicon (F = forward, R = reverse). Whether a unique read identifier is returned is contingent upon the database itself.doi:10.1371/journal.pone.0050362.g003
Results and Discussion
We have developed a computational method to generate in silico amplifications from degenerate primer sets searched against user defined nucleotide databases. To illustrate the utility of De-MetaST-BLAST, we demonstrate its performance using a novel degenerate primer set designed for use on environmental samples. This primer set targets the bacterial boxB gene, which encodes the oxygenase component of a multi-enzyme epoxidase (EC 1.14.13) that is specific to a benzoate catabolic pathway . Three metagenome libraries representing different environments, library size and DNA sequencing methods were searched and found to contain putative boxB amplicons of the appropriate size (300 bp) (Table 1). Figure 3 shows the typical output of De-MetaST-BLAST for one of those database searches, which includes for each in silico amplicon the top 10 BLASTx hits with their corresponding E-value and GenBank accession number.
Table 2. Runtime duration of De-MetaST.doi:10.1371/journal.pone.0050362.t002
To retrieve an in silico amplicon, the program requires both primers to match their respective targets in a single sequence read or sequence assembly (contig). Thus, an important consideration in terms of selection of appropriate searchable databases is the average length of the sequence read or assembly contained within it, as well as the desired amplicon size. This concern may be alleviated as longer read sequencing technologies are developed and/or as sequence coverage and assembly algorithms improve. Interestingly, our analysis demonstrates that in silico amplicons of ~300 bp and ~190 bp, representing boxB and 16S rRNA gene amplicons, respectively, can be readily recovered from databases dominated by short read length sequences (e.g. AntarcticaAquatic; Table 1). In fact, the 44 boxB amplicons derived from the AntarcticaAquatic dataset were found in reads that ranged from 348–541 bp in length. This result suggests that sequence coverage, or depth, is also a contributing factor to in silico amplicon recovery. Incidentally, all of the in silico amplicons recovered in this demonstration run were found to be homologous to the desired target (E-value ≤1e−4).
In terms of data mining, De-MetaST can provide complementary sequence data for gene diversity studies. As the De-MetaST output provides the sequence from the same genetic positions as that derived from a companion clone library, downstream analysis, such as sequence alignment and subsequent phylogenetic analysis, is streamlined. Thus, in silico amplicons retrieved from existing sequence datasets can be readily compared to experimentally derived clone library sequences. Furthermore, as the nucleotide sequences targeted by the primers are returned in the De-MetaST output, users can draw on that information to further refine their primers according to a desired level of functional and/or phylogenetic specificity. The program also has utility beyond searches of environmental sequence databases. It can be used to query any nucleotide dataset, including those derived from single organisms. Thus, it has use in assessing the specificity of primers targeting multi-copy or homologous genes within a single organism or group of organisms.
Benchmarks and System Requirements
De-MetaST-BLAST has been developed for the long-term support (LTS) Ubuntu operating systems 10.04 LTS and 12.04 LTS. While De-MetaST does not make use of multi-core processors, BLAST maintains that capability. Benchmarks were performed on an Intel i7-2600 processor (3.4 GHz quad-core, 8-thread) desktop using the developed degenerate boxB primer set against the Waseca Farm Soil metagenome (AAFX01000000) . This search took approximately 11.7 s (Table 2). When the database size was artificially and incrementally increased up to five-fold (772 Mb) by replication of the original dataset, the processing time remained <1 min. Furthermore, to determine the effect of increased numbers of positive hits on run time, the libraries were seeded with additional sequences containing the target. Doubling of targets within the databases had no effect on run time (Table 2). In contrast to the relatively rapid processing speed of De-MetaST, implementation of the BLAST function can add significant processing time to the process, particularly if a local custom database is used. As an example, for the initial benchmark search against the locally installed Farm Soil metagenome that recovered two hits, the BLASTx function added 39.3 s using two threads. Thus, computational requirements and processing speed are primarily dictated by BLAST. When BLAST is performed remotely–the default setting (see below) –the return time is dependent upon availability and processing speeds of the NCBI servers.
Both De-MetaST and De-MetaST-BLAST can be run on any operating system with a C++ compiler (e.g., standard Windows and Mac OS). However, users would need to ensure the BLAST installation is compatible with their processor.
Availability of De-MetaST-BLAST
The De-MetaST package and the De-MetaST-BLAST wrapper are made freely available at http://sourceforge.net/p/de-metast-blast/and http://code.google.com/p/de-metast-blast/. These files are also provided as supplemental information to this publication (File S1 and File S2). Along with the program, screencast tutorial videos describe how to install the necessary programs as well as implement the software package with the example dataset provided. The De-MetaST package is self-contained and has no external dependencies, except a C++ compiler, such as g++. De-MetaST-BLAST requires a local BLAST+ suite installation that supports direct query of the NCBI nr protein database using NCBI servers via the –remote option. However, the program can also be configured to query a custom local database. Both approaches are described in tutorial videos provided. Installation of the De-MetaST program is estimated at 5 min, whereas installation of the BLAST+ suite is estimated to take 3 min, excluding download and extraction times, which are dependent on the user’s internet speed and processing power.
It was recently predicted that the increasing amounts of metagenome sequences will likely serve as a valuable resource in evaluation of the coverage and specificity of previously developed primer sets . De-MetaST-BLAST will provide users with a useful tool in such evaluations. De-MetaST is designed to provide in silico amplicons generated by user defined degenerate primers found within a user defined nucleotide database. When paired with BLAST, the program returns the most homologous GenBank hits, which are useful in assessing the specificity of degenerate primers. However, the program does not evaluate PCR kinetics and efficiencies with degenerate primers. Thus, users are encouraged to consult appropriate references on the use and design of degenerate primers (e.g., –), including those that discuss the merits of utilizing base analogs (e.g., inosine; ) that can reduce the overall degeneracy of primers.
Computational procedures of De-MetaST are illustrated within the De-MetaST-BLAST wrapper.
Archive containing the source code for De-MetaST.
Archive containing the source code for De-MetaST-BLAST.
We thank Dr. Charles R. Budinoff for insightful discussions and Ashley M. Frank, P. Jackson Gainer, and W. Nathan Cude for providing valuable feedback on program installation and use.
Conceived and designed the experiments: CAG. Performed the experiments: CAG TCE. Analyzed the data: CAG AB. Contributed reagents/materials/analysis tools: AB. Wrote the paper: CAG SWW AB.
- 1. Jarman SN, Deagle BE, Gales NJ (2004) Group-specific polymerase chain reaction for DNA-based analysis of species diversity and identity in dietary samples. Molecular Ecology 13: 1313–1322.
- 2. Brown MV, Schwalbach MS, Hewson I, Fuhrman JA (2005) Coupling 16S-ITS rDNA clone libraries and automated ribosomal intergenic spacer analysis to show marine microbial diversity: development and application to a time series. Environmental Microbiology 7: 1466–1479.
- 3. Brennerova MV, Josefiova J, Brenner V, Pieper DH, Junca H (2009) Metagenomics reveals diversity and abundance of meta-cleavage pathways in microbial communities from soil highly contaminated with jet fuel under air-sparging bioremediation. Environmental Microbiology 11: 2216–2227.
- 4. El Azhari N, Bru D, Sarr A, Martin-Laurent F (2008) Estimation of the density of the protocatechuate-degrading bacterial community in soil by real-time PCR. European Journal of Soil Science 59: 665–673.
- 5. Schmalenberger A, Kertesz MA (2007) Desulfurization of aromatic sulfonates by rhizosphere bacteria: high diversity of the asfA gene. Environmental Microbiology 9: 535–545.
- 6. Mehta MP, Butterfield DA, Baross JA (2003) Phylogenetic diversity of nitrogenase (nifH) genes in deep-sea and hydrothermal vent environments of the Juan de Fuca ridge. Applied and Environmental Microbiology 69: 960–970.
- 7. Bürgmann H, Widmer F, Von Sigler W, Zeyer J (2004) New molecular screening tools for analysis of free-living diazotrophs in soil. Applied and Environmental Microbiology 70: 240–247.
- 8. Luton PE, Wayne JM, Sharp RJ, Riley PW (2002) The mcrA gene as an alternative to 16S rRNA in the phylogenetic analysis of methanogen populations in landfill. Microbiology 148: 3521–3530.
- 9. Chadhain SMN, Schaefer JK, Crane S, Zylstra GJ, Barkay T (2006) Analysis of mercuric reductase (merA) gene diversity in an anaerobic mercury-contaminated sediment enrichment. Environmental Microbiology 8: 1746–1752.
- 10. Wang GZ, Wang YR, Yang PL, Luo HY, Huang HQ, et al. (2010) Molecular detection and diversity of xylanase genes in alpine tundra soil. Applied Microbiology and Biotechnology 87: 1383–1393.
- 11. Wang LP, Wang WP, Lai QL, Shao ZZ (2010) Gene diversity of CYP153A and AlkB alkane hydroxylases in oil-degrading bacteria isolated from the Atlantic Ocean. Environmental Microbiology 12: 1230–1242.
- 12. Matteson AR, Loar SN, Bourbonniere RA, Wilhelm SW (2011) Molecular enumeration of an ecologically important cyanophage in a laurentian great lake. Applied and Environmental Microbiology 77: 6772–6779.
- 13. Jarman SN (2004) Amplicon: software for designing PCR primers on aligned DNA sequences. Bioinformatics 20: 1644–1645.
- 14. Staheli JP, Boyce R, Kovarik D, Rose TM (2011) CODEHOP PCR and CODEHOP PCR primer design. Methods in molecular biology (Clifton, NJ) 687: 57–73.
- 15. Rose TM, Henikoff JG, Henikoff S (2003) CODEHOP (COnsensus-DEgenerate hybrid oligonucleotide primer) PCR primer design. Nucleic Acids Research 31: 3763–3766.
- 16. Rose TM, Schultz ER, Henikoff JG, Pietrokovski S, McCallum CM, et al. (1998) Consensus-degenerate hybrid oligonucleotide primers for amplification of distantly related sequences. Nucleic Acids Research 26: 1628–1635.
- 17. Fuchs T, Malecova B, Linhart C, Sharan R, Khen M, et al. (2002) DEFOG: A practical scheme for deciphering families of genes. Genomics 80: 295–302.
- 18. Wei X, Kuhn D, Narasimhan G (2003) Degenerate primer design via clustering. Proceedings of the IEEE Computer Society Bioinformatics Conference. Stanford, CA. 75–83.
- 19. Linhart C, Shamir R (2007) Degenerate primer design: theoretical analysis and the HYDEN program. Methods in molecular biology (Clifton, NJ) 402: 221–244.
- 20. Najafabadi HS, Saberj A, Torabi N, Chamankhah M (2008) MAD-DPD: designing highly degenerate primers with maximum amplification specificity. Biotechniques 44: 519–+.
- 21. Dwivedi B, Schmieder R, Goldsmith DB, Edwards RA, Breitbart M (2012) PhiSiGns: an online tool to identify signature genes in phages and design PCR primers for examining phage diversity. BMC Bioinformatics 13.
- 22. Gadberry MD, Malcomber ST, Doust AN, Kellogg EA (2005) Primaclade - a flexible tool to find conserved PCR primers across multiple species. Bioinformatics 21: 1263–1264.
- 23. Ludwig W, Strunk O, Westram R, Richter L, Meier H, et al. (2004) ARB: a software environment for sequence data. Nucleic Acids Research 32: 1363–1371.
- 24. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, et al. (2007) Clustal W and Clustal × version 2.0. Bioinformatics 23: 2947–2948.
- 25. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, et al. (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular Biology and Evolution 28: 2731–2739.
- 26. Kwok S, Chang SY, Sninsky JJ, Wang A (1994) A guide to the design and use of mismatched and degenerate primers. Genome Research 3: S39–S47.
- 27. Junca H, Pieper DH (2004) Functional gene diversity analysis in BTEX contaminated soils by means of PCR-SSCP DNA fingerprinting: comparative diversity assessment against bacterial isolates and PCR-DNA clone libraries. Environmental Microbiology 6: 95–110.
- 28. Hendrickx B, Junca H, Vosahlova J, Lindner A, Ruegg I, et al. (2006) Alternative primer sets for PCR detection of genotypes involved in bacterial aerobic BTEX degradation: Distribution of the genes in BTEX degrading isolates and in subsurface soils of a BTEX contaminated industrial site. Journal of Microbiological Methods 64: 250–265.
- 29. Lyons JI, Newell SY, Buchan A, Moran MA (2003) Diversity of ascomycete laccase gene sequences in a southeastern US salt marsh. Microbial Ecology 45: 270–281.
- 30. Allen AE, Ward BB, Song BK (2005) Characterization of diatom (Bacillariophyceae) nitrate reductase genes and their detection in marine phytoplankton communities. Journal of Phycology 41: 95–104.
- 31. Malmstrom RR, Coe A, Kettler GC, Martiny AC, Frias-Lopez J, et al. (2010) Temporal dynamics of Prochlorococcus ecotypes in the Atlantic and Pacific oceans. ISME Journal 4: 1252–1264.
- 32. Kibbe WA (2007) OligoCalc: an online oligonucleotide properties calculator. Nucleic Acids Research 35: W43–W46.
- 33. Kirchman DL, Yu LY, Cottrell MT (2003) Diversity and abundance of uncultured Cytophaga-like bacteria in the Delaware Estuary. Applied and Environmental Microbiology 69: 6587–6596.
- 34. Ochsenreiter T, Selezi D, Quaiser A, Bonch-Osmolovskaya L, Schleper C (2003) Diversity and abundance of Crenarchaeota in terrestrial habitats studied by 16S RNA surveys and real time PCR. Environmental Microbiology 5: 787–797.
- 35. Stach JEM, Maldonado LA, Ward AC, Goodfellow M, Bull AT (2003) New primers for the class Actinobacteria: application to marine and terrestrial environments. Environmental Microbiology 5: 828–841.
- 36. Rotthauwe JH, Witzel KP, Liesack W (1997) The ammonia monooxygenase structural gene amoA as a functional marker: Molecular fine-scale analysis of natural ammonia-oxidizing populations. Applied and Environmental Microbiology 63: 4704–4712.
- 37. López-López A, Bartual SG, Stal L, Onyshchenko O, Rodríguez-Valera F (2005) Genetic analysis of housekeeping genes reveals a deep-sea ecotype of Alteromonas macleodii in the Mediterranean Sea. Environmental Microbiology 7: 649–659.
- 38. Sun S, Chen J, Li W, Altintas I, Lin A, et al. (2011) Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource. Nucleic Acids Research 39: D546–D551.
- 39. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403–410.
- 40. Rather LJ, Knapp B, Haehnel W, Fuchs G (2010) Coenzyme A-dependent aerobic metabolism of benzoate via epoxide formation. Journal of Biological Chemistry 285: 20615–20624.
- 41. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. (2005) Comparative metagenomics of microbial communities. Science 308: 554–557.
- 42. Iwai S, Johnson TA, Chai B, Hashsham SA, Tiedje JM (2011) Comparison of the specificities and efficacies of primers for aromatic dioxygenase gene analysis of environmental samples. Applied and Environmental Microbiology 77: 3551–3557.
- 43. Kwok S, Chang SY, Sninsky JJ, Wang A (1994) A guide to the design and use of mismatched and degenerate primers. PCR-Methods and Applications 3: S39–S47.
- 44. Preston GM (1997) Cloning gene family members using the polymerase chain reaction with degenerate oligonucleotide primers. Methods in Molecular Biology (Clifton, N.J.) 69: 97–113.
- 45. Liu H, Nichols R (1994) PCR amplification using deoxyinosine to replace entire codon and at ambiguous positions. Biotechniques 16: 24–26.
- 46. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, et al. (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology 75: 7537–7541.
- 47. Muyzer G, Dewaal EC, Uitterlinden AG (1993) Profiling of complex microbial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reaction-amplified genes coding for 16S rRNA. Applied and Environmental Microbiology 59: 695–700.