Reader Comments

Post a new comment on this article

DNA barcoding of fungi

Posted by dhickey on 05 Feb 2007 at 17:39 GMT

I share the concern of Nilsson et al. regarding the taxonomic reliability of DNA sequences in public databases, especially if these sequences are to be used as DNA barcodes. I would like to point, however, that this is a concern that is shared by the entire DNA barcoding community. Indeed, this is the reason that a special "BARCODE" keyword has been developed by GenBank. Only those sequences that meet a given level of taxonomic reliability will be assigned the keyword. Currently, there are about 6,000 sequences in GenBank that have been assigned this keyword.
Not only did Nilsson et al. use non-validated sequences for their analysis, they also used an unreliable method of finding related sequences in the database. While the BLAST algorithm is very powerful for quick database searches, it has been shown to be very unreliable for distinguishing between degrees of sequence relatedness (see Koski LB and Golding GB. (2001) The closest BLAST hit is often not the nearest neighbor.
J Mol Evol. 52:540-542). In addition, the default gap penalties in BLAST are optimized for protein coding sequences and will give poor results for ITS sequences where indels are much more frequent.
In contrast to the findings of Nilsson et al. using ITS sequences, we have found that the cytochrome oxidase 1 barcode region (also gathered from the public databases) can provide a very high level of species identification among the fungi (Min XJ and Hickey, DA (2007) Assessing the effect of varying sequence length on DNA barcoding of fungi. Molec. Ecol. Notes. Published article online: 22-Jan-2007).
Finally, we would like to point out that recent studies of animal barcodes confirm that the main value of DNA barcoding will be in the assignment of specimens to species rather than in the building of molecular phylogenies that reflect species relationships (Hajibabaei M, Singer GA, and Hickey DA. (2006) Benchmarking DNA barcodes: An assessment using available primate sequences. Genome. 49: 851-854).

RE: DNA barcoding of fungi

RHNi replied to dhickey on 07 Feb 2007 at 15:21 GMT

>Indeed, this is the reason that a special "BARCODE" keyword has been
>developed by GenBank. Only those sequences that meet a given level of
>taxonomic reliability will be assigned the keyword.

I do not disagree with this - on the contrary, I believe it is the way forward. Indeed, such an approach underpins the whole of the UNITE initiative.



>Not only did Nilsson et al. use non-validated sequences for their analysis,

As the title of the paper suggests, we wanted to estimate the taxonomic reliability of all fungal [ITS] sequences in GenBank, not any particular subset of sequences. This reliability has been the subject of much speculation – with every opinion from “very poor” to “unexpectedly good” having been voiced in the debate – and we wanted to provide a reasonably objective estimate of it. It is, and I'm sure You agree, better to deduce through estimation than to suspect through inclination.

You could call the INSD fungal ITS sequences for “non-validated”. You could also call them “primary data” like many others do. Either way, there are now estimates on their taxonomic reliability available.



>they also used an unreliable method of finding related sequences in the
>database. While the BLAST algorithm is very powerful for quick database
>searches, it has been shown to be very unreliable for distinguishing between
>degrees of sequence relatedness (see Koski LB and Golding GB. (2001) The
>closest BLAST hit is often not the nearest neighbor. J Mol Evol. 52:540-542).

I do not disagree. Indeed, with regard to BLAST and relatedness, You will find “...and are associated with a range of additional complications such that [BLAST's] use for taxonomic identification has been cautioned in recent years...” in the very Introduction of the article. Sequence identification through automated phylogenetic analysis is the default in the UNITE database.

I take it that You are referring to the “Sequences best matched by an identified sequence” estimate. As its name suggests, this estimate tells of the proportion of sequences that are best matched [by BLAST] by an identified [to species level] sequence. It is not called “Sequences most closely related to an identified sequences” simply because, and as You point out, that is not what we estimate.



>the default gap penalties in BLAST are optimized for protein coding sequences
>and will give poor results for ITS sequences where indels are much more
>frequent.

I do not disagree. But then again, to estimate the taxonomic reliability of the sequences, we employed only sequences of sufficient length where the two lexicographically heterospecific sequences were at least 98.5% identical over the whole length of the shortest of the two sequences (that is, the two sequences typically differed by say 5 bp).

While I share Your concerns about the way BLAST scores gaps in this context, I'm sure You will agree that the default gap penalties of BLAST have no impact on this estimate. No sequence pair was erroneously grouped together and used for comparison due to these.



>In contrast to the findings of Nilsson et al. using ITS sequences, we have
>found that the cytochrome oxidase 1 barcode region (also gathered from the
>public databases) can provide a very high level of species identification
>among the fungi (Min XJ and Hickey, DA (2007) Assessing the effect of
>varying sequence length on DNA barcoding of fungi. Molec. Ecol. Notes.
>Published article online: 22-Jan-2007).

You will find that we do not challenge the use of CO1 in the article, nor do we suggest that the ITS region be inappropriate for barcoding purposes. I therefore cannot follow You when You say “In contrast to the findings of Nilsson et al...”. What is intended?



>Finally, we would like to point out that recent studies of animal barcodes
>confirm that the main value of DNA barcoding will be in the assignment of
>specimens to species rather than in the building of molecular phylogenies
>that reflect species relationships (Hajibabaei M, Singer GA, and Hickey DA.
>(2006) Benchmarking DNA barcodes: An assessment using available primate
>sequences. Genome. 49: 851-854).

You will find that we do not challenge this in the article. The high degree of sequence variation needed for species identification should serve to preclude any such gene from being widely applicable for phylogenetic purposes (other than, I suppose, for limited taxonomic scopes). For similar reasons, you don't see many people approaching the large-scale phylogeny of the fungi through ITS-based analyses.

Sincerely,

Henrik Nilsson