Research Article

Mapping Accuracy of Short Reads from Massively Parallel Sequencing and the Implications for Quantitative Expression Profiling

  • Nicola Palmieri,

    Affiliation: Institut für Populationsgenetik, Veterinärmedizinische Universität Wien, Vienna, Austria

  • Christian Schlötterer mail

    Affiliation: Institut für Populationsgenetik, Veterinärmedizinische Universität Wien, Vienna, Austria

  • Published: July 28, 2009
  • DOI: 10.1371/journal.pone.0006323

Reader Comments (10)

Post a new comment on this article

Collaboration with tool authors required

Posted by idot on 13 Aug 2009 at 13:40 GMT

I think this article shows, why necessary tool evaluation should be a community effort conducted together with the tool authors themselves. Of course a tool should have sensible defaults But the more versatile a tool is, the more requirements it is able to fulfill, the more esoteric the options become. To really evaluate a tool one would have to study it in more detail. And to properly use a tool the same level of knowledge is necessary.

I just mention the bowtie settings used (I use bowtie myself sometimes to quickly map reads, and I am not affiliated with the authors):
–k 1, -n 3, -e 2000
-n is the max mismatches in the seed(!). -e 2000 selects for a "quality-weighted hamming distance", not a total
number of mismatches. This would have been option -v <int> where I don't think a maximum exists (as the authors state in the paper).
The random assignment of ambigous reads could have been easily turned off with -m 1 as they did with the clc program (-r ignore).

No competing interests declared.

RE: Collaboration with tool authors required

lh3lh3 replied to idot on 14 Aug 2009 at 10:20 GMT

Agreed. To evaluate aligners, one must fully understand how each aligner works. Sometimes even the developer him/herself is not sure about the behavior of his/her own aligner. Let alone others.

For bowtie, one can discard repetitive hits by running it with -m1, or --best -k2 and filter later on. I think it is right to use -e. Using -v 3 or more is inefficient. For maq, one can simply set a threshold on mapping quality to discard repetitive hits.

In addition, in table 2, the authors should map more than 1,000,000 to evaluate the speed. SeqMap and maq are highly inefficient given only 100,000 reads. I do not know how CLC Bio NGS works.

Competing interests declared: Also writing aligners.