Conceived and designed the experiments: YS SS YL OH IP. Performed the experiments: YS SS YL IP. Analyzed the data: YS SS YL OH IP. Wrote the paper: YS SS OH IP.
The authors have declared that no competing interests exist.
Whole-genome sequencing represents a promising approach to pinpoint chemically induced mutations in genetic model organisms, thereby short-cutting time-consuming genetic mapping efforts.
We compare here the ability of two leading high-throughput platforms for paired-end deep sequencing, SOLiD (ABI) and Genome Analyzer (Illumina; “Solexa”), to achieve the goal of mutant detection. As a test case we used a mutant
In conclusion, whole-genome sequencing conducted by either platform is a viable approach for the identification of single-nucleotide variations in the
Genetically amenable model organisms have been extensively subjected to forward genetic screening approaches in which mutant individuals that are defective in a given biological process are isolated. Mutant isolation has traditionally been followed by time consuming mapping procedures that localize the experimentally induced region to a specific locus. We have previously shown that sequencing with the Genome Analyzer (GA) by Illumina is capable of identifying a molecular lesion in a
In an effort to better inform the design, implementation and analysis of such genome-wide deep sequencing experiments, we now report sequencing of the same
We sequenced genomic DNA samples, isolated from the
Both SOLiD and GA reads were produced in paired-ends, with SOLiD reads at a size of 25 bp and GA reads at a size of 35 bp. Totally 146 million SOLiD reads were mapped to the wild-type reference genome in total, representing an average depth-coverage of 33× (
Platform | SOLiD by Maq | GA by Maq | SOLiD by corona-lite |
Read size | 25 bp | 35 bp | 25 bp |
Total reads (million) | 256 | 125 | 256 |
Mapped (million reads/Gb) | 146/3.65 | 84.5/2.96 | 109/2.73 |
Good pairs (million)/percentage | 41.3/57% | 38.6/91% | 35.5/65% |
Single end mapped (million)/percentage | 37.3/26% | 2.23/2.6% | 35.2/32.6% |
Avg. depth-coverage | 33× | 28× | 27× |
Avg. depth-coverage from good reads | 25× | 25× | NA |
Mapped to different chromosomes (million)/percentage | 21.9/15% | 0.3/0.4% | NA |
The depth-coverage distributions of the entire genome from both SOLiD and GA sequencing are summarized in
The distribution of depth-coverage of the entire genome is shown for both SOLiD and GA. Poisson and gamma distributions with comparable average mean values are imposed on the observed distribution. Only reads that are mapped with no more three mismatches and without inconsistent mate-pairs are counted in the depth-coverage calculation.
Depth-coverage | SOLiD | GA |
> = 0 | 99.98% | 99.96% |
> = 5 | 99.65% | 99.45% |
> = 10 | 97.71% | 95.83% |
> = 15 | 91.64% | 86.22% |
> = 20 | 78.58% | 70.85% |
> = 25 | 59.19% | 52.96% |
> = 30 | 38.79% | 36.18% |
> = 100 | 1.56% | 0.383% |
7385 variants were called from GA, and 5798 were called from SOLiD. We considered mapping errors and sequencing errors to improve the accuracy of variant calling. Specifically, we only considered variants that meet the following conditions:
There are at least two reads from both strands that contain the variant allele. Any lower threshold significantly increases the number of reported variants that are likely false positives, without adding many true positives (see below).
The average number of hits per read in the position is less than 1.1. This represents a conservative cutoff to avoid repeats and alleviate the mapping issues with the shorter SOLiD reads.
The depth-coverage is less than 60. That is, we filtered out variants with >60× variants, as those are suspected to lie within repeat regions.
GA reads only: The number of reads representing the wild-type allele is less than the number of reads representing the variant allele. This condition is based on our previous analysis validation of the GA genome dataset
After such filtering, 901 total genomic variants were left for SOLiD and 1094 variants for GA. 685 of them were shared by both platforms (
SOLiD | GA | Common variants (confirmed true/confirmed false/repeats) | |
Raw | 5798 | 7385 | 1689 (NA/NA/559) |
Filtered | 901 | 1094 | 685 (NA/NA/0) |
ChrV 4 Mb region, raw | 180 | 180 | 42 (32 |
ChrV 4 Mb region, filtered | 24 | 35 | 23 (23 |
Variants listed in
Position on chromosome V | GA Variants called in Ref. |
Found by SOLiD | Why not found by SOLiD? | Type of variant |
6302463 | YES | NO | Repeats | non-exonic |
6889636 | YES | NO | Low coverage | exonic, silent |
6889637 | YES | NO | Low coverage | exonic, amino-acid changing |
6956711 | YES | NO | Low coverage | exonic, amino-acid changing |
6956743 | YES | NO | Low coverage | exonic, amino-acid changing |
6956744 | YES | NO | Low coverage | exonic, amino-acid changing |
7245105 | YES | YES | non-exonic | |
7377580 | YES | YES | non-exonic | |
7403427 | YES | YES | exonic, silent | |
7430567 | YES | YES | exonic, amino-acid changing | |
7524635 | YES | YES | non-exonic | |
7546600 | YES | YES | exonic, amino-acid changing | |
7860248 | YES | NO | Repeats | non-exonic |
7953203 | NO | YES | non-exonic | |
8101405 | YES | NO | Mapping | non-exonic |
8571627 | YES | YES | exonic, amino-acid changing | |
8646873 | YES | YES | non-exonic | |
8657771 | YES | YES | non-exonic | |
8758179 | YES | YES | exonic, amino-acid changing | |
9059200 | YES | YES | non-exonic | |
9217870 | YES | YES | non-exonic | |
9218397 | YES | YES | non-exonic | |
9245971 | YES | YES | exonic, amino-acid changing | |
9376379 | YES | YES | exonic, amino-acid changing | |
9662867 | YES | NO | Not covered | non-exonic |
9663159 | YES | YES | non-exonic | |
9707449 | YES | YES | non-exonic | |
9846725 |
YES | YES | exonic, amino-acid changing | |
9927293 | YES | YES | non-exonic | |
9928614 | YES | YES | exonic, silent | |
9986752 | YES | YES | exonic, silent | |
10234234 | YES | YES | non-exonic | |
10397711 | YES | YES | non-exonic |
The nucleotide change of the variants are shown in
This is the variant that is responsible for the mutant phenotype of
The causal mutation in the
Without any filtering, both platforms identified about 180 raw variants in the 4 Mb region with 42 shared. Among the shared ones, 32 were confirmed by Sanger sequencing, 9 were left unvalidated, and 1 was confirmed false. 8 of the unvalidated variants and 1 confirmed false were from repeat regions, as indicated by the average number of hits per read (>1.1). The other unvalidated variant might be false positive because it appears to be heterozygous. It is notable that, in total, 16 variants from GA were confirmed to be false in our previous study, yet only one false-positive was shared between datasets
We called small insertions and deletions (indels) using Maq. Similar to variant filtering, we discarded indels in consideration of mapping errors and sequencing errors. We designed two sets of rules:
Normal filtering:
coverage <80, and
number of indel reads from each strand >1.
Liberal filtering:
coverage <80, and
number of indel reads from each strand > = 1.
The result of indels is summarized in
# of indels | SOLiD | GA | common |
normal filtering | 420 | 1280 | 374 |
liberal filtering | 782 | 1796 | 663 |
From the chrV 4 Mb region, 26 indels were reported in our previous study of the GA sequence run
For the GA, with the normal filtering rules described above, we get 29 indels, 25 of which were validated by manual re-sequencing, 4 were left unvalidated. One indel published was missed here due to low coverage from one DNA strand. With liberal filtering, we get all 26 confirmed indels and an additional 14 indels which were left unvalidated.
The SOLiD technology has built-in error-detection and correction. The corona-lite mapping tool (
We compared here the performance of detecting mutations from forward genetics by two high-throughput platforms. We were able to find the causal mutation by both SOLiD and Solexa at similar average coverage. The SOLiD reads are relatively shorter. This is likely the main reason for a lower fraction of mapped reads in good pairs, and a reason for a larger fraction of the genome being covered by more than 100× reads; this fraction mostly includes regions with repeating 25-mers.
In the chrV 4 Mb region in which we had mapped
Practically, it is important to further reduce the cost of sequencing while keep reasonable sensitivity and accuracy of mutation detection. Given the instrument and protocol, one way to do this is to reduce the overall depth-coverage to a minimum level. Based on our result, having at least two reads from both strands is a good basic measurement of variants. If we assume that the depth-coverage follows gamma distribution, Gamma(6,
The x-axis is the average depth-coverage. The y-axis is the theoretical fraction of genome where potential variants can be detected under the assumptions described above. The red dot marks 95% sensitivity at 13× coverage.
In conclusion, we found that SOLiD calls less false-positives variants compared to the GA and GA calls less false-negatives variants compared to the SOLiD. The tradeoff between tolerating false negatives, and being able to follow up false candidates is therefore an important determinant for platform choice (summarized in
Feature | Preferred Platform |
Reducing false positives | SOLiD |
Reducing false negatives | GA |
Raw accuracy | SOLiD |
Mapping | GA |
Ease of library preparation | GA |
We emphasize that this work represents only a snapshot-comparison undertaken during a technological tornado. Multiple vendors, including those discussed here, but others as well
Genomic DNA preparation: Genomic
Sequencing runs were performed by Agencourt Bioscience Corporation (a Beckman Coulter Company) for the ABI Solid runs and by Illumina's in-house sequencing services for the GA sequence run, as described in
All reads were mapped using Maq. The maximum allowed number of mismatches per read was 3 for both platforms. This cutoff was selected to accommodate both mismatches due to true variants, as well as ones due to errors that at are rare per bp, but much more frequent per-read. The maximum outer distance for a correct read pair was set to 5000 for SOLiD and 250 for GA. Other parameters were default. The SOLiD reads were treated slightly different than GA reads in Maq: the -p in pileup function is supposed to output only the read pairs that are regarded as good, but none of the SOLiD pairs are regarded good because SOLiD reads are always FF or RR oriented, whereas only FR reads are regarded as good reads – based on the man page on mapview function. So we took all good paired SOLiD reads as well as single-end mapped reads and re-mapped them using Maq. As for GA reads, we output pileup using -p option. Thus the only difference is the single-ended reads are discarded in GA pileup, which would not have much impact on the variant and coverage analysis because they only contributed 2.6% to all mapped GA reads.
We thank Marc Rubenfield at Agencourt Bioscience Corporation (a Beckman Coulter Company) for generating the ABISolid data.