The authors have declared that no competing interests exist.
Conceived and designed the experiments: JMZ MS. Performed the experiments: JM SKS. Analyzed the data: JMZ DS. Wrote the paper: JMZ DS JM SKS MS.
While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
As sequencing costs drop, it is becoming cost-effective to sequence even whole genomes to a sufficient depth that random errors become insignificant. However, systematic sequencing errors (SSEs) and biases remain problematic even at high sequencing depths, so recent research has started to focus on understanding these SSEs and biases
(a) Observed variants in the reads can result from a variety of biological causes and sequencing errors and biases. (1) Random sequencing errors are relatively rare at any given position in the reference, and are generally reflected accurately in the reported base quality score from the instrument. (2) Biological variants that are included in the SNP database (e.g., dbSNP for humans) are excluded from the base quality score recalibration (BQSR), and therefore do not decrease the empirical quality scores. (3) RNA editing can occur at frequencies less than 50%, so it can be difficult to distinguish from SSEs. These observed variants are treated as SSEs by the BQSR algorithm, incorrectly decreasing their base quality scores and quality scores of similar bases in other locations in the genome. (4) Biological variants that are not in dbSNP are also treated as SSEs by the BQSR algorithm, again decreasing their and similar bases’ recalibrated quality scores. (5) Since variant bases are only seen on one strand, they are likely to be SSEs. In this case, the BQSR algorithm would decrease the quality scores of the dinucleotide on the forward strand (GG). (b) Example reads and the covariates for each base used by GATK BQSR. The red columns would be counted as errors when calculating empirical quality scores. (c) Schematic of the GATK BQSR process, in which reported quality scores from the instrument are adjusted (or “recalibrated”) using empirical quality scores associated with the covariates reported quality score, machine cycle, and dinucleotide context.
The first approach, SysCall, used a methyl-Seq dataset that had overlapping paired-end reads to detect SSEs depending on sequencing direction for the Illumina sequencer
The second approach, GATK, is a suite of tools used for variant calling based on the map-reduce framework
One attractive feature of the GATK BQSR package is that it compensates for biases in the machine’s estimation of base calls and their quality scores by lowering the base qualities of SSEs independently in each sample, so that it can correct run- or batch-specific systematic sequencing errors. This appears to contradict established metrological practice for quantity (or numerical) values (see
In order to find run-specific SSEs, GATK BQSR typically assumes that the sample contains neither mis-mapped/mis-aligned reads nor variants (except at the bases defined by dbSNP). Since many rare variants are not in dbSNP and some SSEs might be in dbSNP
A few previous methods have used external spike-ins to calibrate base or variant calls. The Ibis base caller for Illumina uses the PhiX spike-in provided by Illumina to achieve more accurate base calls and quality scores by correcting machine cycle biases
Here, we set out to test whether synthetic spike-in DNA or RNA standards with well-characterized sequence purities can be used in the GATK BQSR framework to improve detection and correction of SSEs in any sample without the need for a comprehensive and accurate SNP database.
A set of 96 DNA plasmids with 273 to 2022 base pair standard sequences inserted in a ∼2800 base pair vector are a prospective NIST Standard Reference Material and a product of the External RNA Control Consortium (ERCC). The DNA plasmids were designed to be templates for RNA spike-in standards for gene expression measurements. Although the spike-in standards are generally used for quantitation, the DNA plasmids were extensively characterized by Sanger, Illumina, SOLiD2, and SOLiD3 sequencing and thus are useful for characterizing SSEs as well. These spike-in standards could be used for characterizing SSEs both in DNA and RNA sequencing, but in this paper we focus on RNA spike-ins.
The base quality scores reported by the instrument are frequently not accurate measures of error rates, in part due to SSEs associated with covariates such as machine cycle and dinucleotide context. Base quality score recalibration is commonly used to compensate for SSEs by adjusting base qualities, using the empirical error rates measured for bases with specific covariate values. These empirical error rates can be converted to Phred-scaled quality scores, and the difference between these empirical base quality scores and the reported base quality scores are the “recalibration coefficients” used to recalibrate the base quality scores. Ideally, only highly pure bases (
The 78950 highly pure bases in the spike-in standards are much fewer than in the whole genome or transcriptome, but the coverage is much higher. Therefore, it is important to understand possible inaccuracies in calculating the GATK BQSR coefficients due to the limited size of the standards. Recalibration of base quality (BQ) scores is typically based on three covariates: reported quality score (RpQS), machine cycle (position in the read), and dinucleotide context (the base and the previous base). To achieve statistically relevant recalibration scores for the large number of combinations of these covariates, recalibration must be performed using a sufficient number of bases in the reference and sufficient coverage by reads mapping to the reference. To determine whether the coverage and number of bases in the spike-in standards are sufficient, we randomly removed mapped reads and/or reference bases from the recalibration calculations, and calculated the effect on the recalibration scores.
As expected, reducing coverage or the size of the spike-in standard “genome” results in some inaccuracies in BQSR (see
To demonstrate the relative performance of BQSR using the spike-in standards compared to the typical method using the whole genome, we compared the recalibration inaccuracies caused by the limited size and coverage of the spike-in standards to the inaccuracies caused by recalibrating based on the genome. For this comparison, the 96 RNA spike-in standard sequences were spiked into human genomic RNA samples. To obtain an upper limit on the errors introduced by the limited size and coverage of the spike-in standards, half of the bases were randomly selected from the spike-in standards (spike-in set A). Then, the mean absolute difference was calculated between the recalibration values obtained from the selected bases and the recalibration values obtained from the other half of the bases in the spike-in standards (spike-in set B). In addition, the mean absolute difference was calculated between recalibrating based on the genome and recalibrating based on the spike-in standards. These calculations were performed for samples with the standards spiked-in either at equal concentrations (Illumina-EP in
The errors due to the limitations of the spike-in standards are the mean absolute difference between the recalibration coefficients calculated from randomly selected 50% of the spike-in standard bases (ERCC Set A) and the opposite 50% of the bases (ERCC Set B). Because the mean absolute differences are lower for the spike-in standards, they serve as a reasonable proxy for accuracy of the recalibration coefficients. Differences are calculated for the base quality score reported from the instrument (RpQS), dinucleotide context (Dinuc), and machine cycle (Cycle). The differences are the mean ± SD (n = 4) for SOLiD4 with spike-in standards spiked-in in a large dynamic concentration range with 250–700× mean coverage (SOLiD-DR), and for Illumina HiSeq with spike-in standards spiked-in at equimolar concentrations with 5500–8500× mean coverage (Illumina-EP). The use of spike-in standards for recalibration significantly improves upon the traditional genome recalibration in all cases (p<10−4).
The accuracy of genome recalibration was significantly reduced (p<10−4) for reported quality scores (RpQS), dinucleotide context (Dinuc), and machine cycle (Cycle) for both Illumina-EP and SOLiD-DR. Genome recalibration was particularly more biased for Illumina-EP and for the dinucleotide and RpQS covariates. The biases for spike-in standard recalibration were very small for Illumina-EP both because the spike-in standards were at equimolar concentrations and because they had higher coverage. However, even when spiking in standards at a wide range of concentrations, as is often done for differential gene expression measurements
To understand the biases caused by recalibrating based on the whole genome, we performed a statistical comparison of recalibration scores obtained from the whole genome to those obtained from the spike-in standards, with reported quality score, machine cycle, and dinucleotide context as factors (
The differences in quality score recalibration values are calculated for each combination of reported quality score and cycle (a and c) or reported quality score and dinucleotide (b and d) for Illumina (a and b) and SOLiD (c and d). White blocks correspond to very large differences, generally with very few errors. The differences are (genome – spike-in standard), so blue (<0) indicates that genome recalibration would result in recalibrated quality scores that are too low, and yellow/red (>0) results in recalibrated quality scores that are too high. The p values for the differences are shown in
A likely explanation for the generally higher empirical error rates (or lower empirical quality scores) measured based on the whole genome is that dbSNP does not include all variants, especially variants found at a low population frequency. The variants not included in dbSNP, as well as non-reference bases resulting from RNA editing and mismapped reads (see
The particularly high error rates measured at CG dinucleotides for genome recalibration likely result from the higher mutation rate of methylated cytosines in CpG dinucleotides
It is interesting that the empirical error rates for genome recalibration are significantly lower for some dinucleotides at low reported BQs. While the reason for this difference is unclear, it is possible that it could be caused by reads that are mis-mapped to nearly homologous regions that contain the base error.
The primary goal of the GATK BQSR algorithm is to identify the association of SSEs with covariates such as dinucleotide and cycle, and compensate for them by adjusting base quality scores. We used GATK to measure sequencing platform-specific biases by calculating the mean differences between the empirical and reported base quality scores for each combination of reported quality score and cycle, or reported quality score and dinucleotide. As shown in
Blue (<0) indicates the reported quality scores are too high, and red/yellow (>0) indicates the reported quality scores are too low. The mean differences across all samples are calculated for each combination of reported quality score and cycle (a and c) or reported quality score and dinucleotide (b and d) for Illumina (a and b) and SOLiD (c and d), respectively.
In addition to dinucleotide and cycle biases, certain nucleotide changes have been associated with SSEs
We calculated the frequency of each nucleotide change for Illumina and SOLiD when recalibrating based on the genome or the standard (
The plots are annotated with transition/transversion (Ti/Tv) ratios, where random base changes result in Ti/Tv = 0.5, and biological mutations result in Ti/Tv >>0.5. To determine the significance of biological variants in the data, only bases with reported reported base quality scores above 30 are included in this analysis. All values are the mean ± SD of 2 samples with 2 biological replicates or of 4 sequenced samples with no replicates.
In
In this work, we have demonstrated the utility of synthetic spike-in standards to interrogate run-specific SSEs in DNA or RNA sequencing. These standards can be used within the GATK framework to recalibrate base quality scores when sequencing species that do not have a comprehensive SNP database. Even for human DNA, which has a very large SNP database, rare variants cause significant biases in the typical GATK recalibration process, which uses reads mapped to genomic bases not in dbSNP. These biases are particularly large for CpG dinucleotides, which have a high mutation rate, so GATK recalibration without spike-in standards significantly degrades variant base qualities at the base positions where variants are much more likely to occur.
It should be noted that some of the differences between recalibration based on the spike-in standards vs. based on the genome could result from SSEs or mapping errors in complex CpG-related sequence motifs in the genome that are not present in the spike-in standards. However, for this to be true, these sequencing motifs would have to cause the same SSEs in multiple sequencing platforms, since CpG dinucleotides have very low empirical quality scores for both Illumina and SOLiD in this work, and for Illumina GAIIx, Illumina HiSeq, SOLiD, and 454 in previous work
By using synthetic spike-in standards, the mean errors in recalibrated quality scores (for RpQS, cycle, and dinucleotide) can be significantly decreased to <0.5 (in quality score units) for 5 million 100-bp reads (or about 0.5% of a 30× human whole genome sequencing run). For RNA-sequencing, the synthetic spike-in standards significantly reduce biases even when spiked-in at a large dynamic range of concentrations for differential expression analysis and using less than 106 50-bp reads, and biases can be reduced even further when using the spiked-ins at equimolar concentrations and at higher concentrations. In summary, we have demonstrated that adding synthetic spike-in standards can improve SSE recalibration within the current GATK framework for human sequencing, and they allow SSE recalibration for species without comprehensive SNP databases, even when only a small fraction of the total reads is dedicated to the synthetic spike-ins.
The candidate NIST Standard Reference Material (SRM) 2374 DNA plasmids were used for DNA sequencing. These plasmids consist of ∼2800 bases of vector sequence with 273–2022 base standard sequence inserted in the vector. In addition, libraries of RNA were prepared from the 96 DNA plasmids in this SRM by
Sequencing of the DNA spike-in standards was performed on Illumina GA, SOLiD v2, and SOLiD v3+ using standard methods described in detail in
Mixtures of RNA from brain, liver, and muscle tissues were prepared, similar to those described previously
Lymphoblastoid cell lines (LCLs) were established from four human male subjects of Caucasian origin by Epstein-Barr Virus transformation of B-lymphocytes using standard procedures
Total RNA was extracted using the RNeasy Mini kit (Qiagen, MD). RNA integrity was assessed using the Agilent 2100 Bioanalyzer RNA 6000 Nano Assay (Agilent, CA). All samples had RIN (RNA Integrity Number) higher than 9.5 as measured by this assay. For each cell line, aliquots of 5 micrograms of total RNA were spiked with 3% by weight of an equimolar mix of the ERCC spike-in standards and RNA-Seq libraries were constructed using a modified version of the technique described in Marioni et al (2008). The following notable changes were made to their protocol: RNA fragmentation was performed using the Covaris S2 AFA platform (Covaris, MA) and size selection of adapter-ligated cDNA was done using the Pippin Prep instrument (Sage Science, MA). RNA-Seq libraries were sequenced on an Illumina GAIIx DNA sequencer (Illumina, CA). Raw image data generated by the sequencer was processed using Illumina RTA software (version 1.8.70.0). One paired-end 101 bp lane of data was generated for each library, resulting in ≈100 million reads from each of the four samples (GEO Accession Number GSE36217).
To determine which bases in the spike-in standards to include in the BQSR, we used a Bayesian statistical model to calculate which bases had a >95% probability of having a purity ≥99%. Our statistical model is described in detail in
To determine the effect of coverage on the recalibration values, the mapped F3 reads in the bam file from the mate-pair library (MP) were downsampled using the Picard (v. 1.5) tool DownsampleSam.java, which randomly removes reads from the bam file. The reads were downsampled by factors of 3, 9, and 27, with the factor of 27 resulting in coverage of the spike-in standards similar to the Illumina and SOLiD2 libraries.
Similarly, one might want to know whether a smaller set of spike-in standards (
To be consistent with how GATK performs recalibration, we aggregated the recalibration values by reported quality score and cycle or by reported quality score and dinucleotide, and then calculated the weighted mean absolute differences between the downsampled recalibration values and the original recalibration values (without downsampling). The values were weighted by the number of observations in each group to obtain the expected mean error in recalibration due to downsampling.
For all RNA-sequencing data, reads were mapped to the hg18 human reference with the ERCC spike-in standard sequences appended. Reads from SOLiD 3+ were mapped using the Whole Transcriptome pipeline in Bioscope (v. 1.3) with default parameters. Reads from Illumina HiSeq were mapped using BWA (v. 0.5.9) with default parameters. The resulting bam files were split into two bam files with reads mapping to hg18 or with reads mapping to the spike-in standards using the Picard tool ReorderSam. GATK was used with the same parameters as described above, except with the dbSNP v.132 vcf file (downloaded from the GATK website) for reads mapped to hg18, and with the modified vcf files containing randomly selected 50% of the bases in the spike-in standards, as described in
To calculate error rates for each type of nucleotide change, we used custom Perl scripts (included in the supporting information) to parse the samtools (v. 0.1.13) pileup output, calculating the error rates for each type of nucleotide change, excluding bases in dbSNP (v. 132) from the genome, and only including high purity bases in the spike-in standards. To minimize the influence of random errors, only bases with reported BQ scores of at least 30 were included in this analysis. For each nucleotide change, the mean errors using ERCCs or the genome for recalibration were statistically compared using paired t-tests with a pooled variance, accounting for multiple comparisons with a Bonferroni adjustment.
Sequencing data has been submitted to the Short Read Archive at NCBI via GEO, with accession numbers
(TIF)
(TIF)
(DOCX)
(XLSX)
(VCF)
(PL)
(M)
(M)
(R)
(R)
(R)
(R)
(R)
(R)
(R)
We thank Leslie Biesecker for a careful reading of this manuscript and providing the RNA samples from the ClinSeq Project. We thank Matthias Roesslein for providing a Perl script to parse samtools pileup outputs, upon which the scripts in this work were based. Certain commercial equipment, instruments, or materials are identified in this document. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the products identified are necessarily the best available for the purpose.