Conceived and designed the experiments: ZKA KDK J. Cools SA. Performed the experiments: ZKA KDK EG VG RV DP MP IL VB HC. Analyzed the data: ZKA KDK EG. Contributed reagents/materials/analysis tools: WGD HQ AU J. Cloos PV. Wrote the paper: ZKA KDK J. Cools SA.
The affiliation of WDG and HQ to Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH does not alter the authors’ adherence to all the PLoS ONE policies on sharing data and materials.
With the advent of whole-genome and whole-exome sequencing, high-quality catalogs of recurrently mutated cancer genes are becoming available for many cancer types. Increasing access to sequencing technology, including bench-top sequencers, provide the opportunity to re-sequence a limited set of cancer genes across a patient cohort with limited processing time. Here, we re-sequenced a set of cancer genes in T-cell acute lymphoblastic leukemia (T-ALL) using Nimblegen sequence capture coupled with Roche/454 technology. First, we investigated how a maximal sensitivity and specificity of mutation detection can be achieved through a benchmark study. We tested nine combinations of different mapping and variant-calling methods, varied the variant calling parameters, and compared the predicted mutations with a large independent validation set obtained by capillary re-sequencing. We found that the combination of two mapping algorithms, namely
Next generation sequencing (NGS) technologies have significantly improved our sequencing capacity in the past five years. They are now widely used for research purposes and are starting to find their way into clinical applications. Although whole genome and whole exome sequencing approaches are successfully implemented for mapping the genomic landscapes of many human diseases, they are not routine strategies for detecting molecular aberrations due to high costs, and long turnover times (run and analysis times). Targeted re-sequencing, on the other hand, is appealing in a clinical setting, given the lower sequencing costs, shorter sequencing time and simpler data analysis. Moreover, as the discovery of novel cancer genes by whole-exome sequencing will gradually saturate and converge into a set of commonly mutated genes in a particular cancer, the identification of these mutations can yield important diagnostic and prognostic information.
Despite the requirement of several days for library preparation and target enrichment for all these platforms, the Roche/454 technology offers the advantages of short run times and data analysis time. In addition, the more restricted data output is also beneficial for turnaround time because fewer patient samples need to be collected to fill an entire sequencing run. Based on these advantages of the 454 platform for sequencing relatively small gene sets, we invested in optimizing bioinformatics pipelines for read mapping and variant calling of 454 reads, with the aim for applying this both for research as well as for clinical purposes. We focused on T cell acute lymphoblastic leukemia (T-ALL), an aggressive hematopoietic cancer caused by malignant transformation of developing T-cells
For accurate variant detection, we investigated several existing analysis pipelines and compared their performance. Although the companion software gsMapper is widely used in the analysis of 454 data
Here, we analyzed and compared nine different combinations of a mapping and variant calling algorithms and particularly investigated to what extent low coverage positions can be included in the variation calling process to increase the sensitivity of mutation detection. Next, we apply the optimized pipeline to identify mutations in a set of 58 cancer genes and 39 candidate genes, across 18 T-ALL cell lines and 15 T-ALL patient samples, and identify recurrent mutations in both known and novel drivers.
The Roche companion software
Each pipeline was applied to the reads obtained from seven T-ALL cell lines and the performance of each pipeline was evaluated by Sanger re-sequencing of 210 candidate variants that were randomly taken from all predicted 8020 variants (containing both SNPs and mutations) from all pipelines. As a measure of the performance of each pipeline, we calculated the Matthews correlation coefficient (MCC), which is a measure of prediction accuracy that is calculated based on the number of successfully predicted true positives and true negatives found by Sanger sequencing (see
In NGS studies, the presence of duplicate reads (caused by a PCR amplification step during library preparation) is a potential source of false positive single nucleotide variant (SNV) prediction
Next, we further optimized the performance of each pipeline by varying the minimal required number of reads (depth of coverage, DoC) and the minimal required variant reads (variant allele frequency, VAF). Changes in DoC thresholds mainly affected the sensitivity, while varying VAF thresholds affected the predictions in terms of specificity (
(A) Different pipelines show different sensitivity and specificity. Varying DoC and VAF thresholds in the variant calling process has an additional affect on the predictions in terms of sensitivity and specificity, respectively. Each pipeline is represented with a different symbol and the performance of each pipeline (in terms of sensitivity and specificity) is plotted under varying DoC and VAF thresholds. Note that the X-axis represents the false positive rate (1-specificity). In this ROC plot, the closer the point to the upper left point of the graph, the better the sensitivity and the specificity. Different colors of the symbols indicate the performance of the pipeline under changing VAF thresholds, and the two shaded boxes indicate the performance under changing DoC thresholds. The plot shows that (i) decreasing the DoC threshold increases the sensitivity of all pipelines as indicated with the blue dotted line; (ii) increasing the VAF threshold increases the specificity with a slight decrease in sensitivity as indicated (in the example of BLAT+VarScan pipeline) with the red dotted line; (iii) the BWA-SW+SSAHA2+Atlas-SNP2 pipeline has the best performance among all pipelines under DoC = 3 & VAF = 0.20 thresholds as indicated with the yellow arrow. The Roche pipeline is indicated with a black diamond shape since no parameter changes were performed on it, and SSAHA2+SAMTools and BWA-SW+SAMTools pipelines were colored grey since no VAF threshold changes were performed on them. (B) The Matthews correlation coefficient for each pipeline is shown for the most optimal performance of that pipeline (
We applied the optimized pipeline determined above, consisting of the SSAHA2+BWA-SW combination for read mapping, and Atlas-SNP2 for variation calling, to identify mutations in a panel of 58 “cancer genes” across 18 T-ALL cell lines and 15 primary T-ALL patient samples. This set of genes consists of 13 T-ALL drivers (
Coding mutations in known cancer genes (A) and candidate genes (B) are indicated with different color codes. Panel A is further subdivided into (I) genes that are known to be drivers in T-ALL, and (II) the genes that have recurrent somatic mutations in various human cancers. The cell lines are located to the left of the table, and the patient samples are located to the right. Genes are ranked according to the frequency of protein altering mutations in the patient samples.
Sequence reads were mapped to the entire reference genome and those reads that map to the selected genes were retained. This resulted in 36% of reads that map to the target sequences on average, with an average coverage of 24.2X and 16.3X for cell lines and patient samples, respectively. Analysis of the sequence data revealed that exons with a very low coverage had a significantly higher GC-content compared to exons with higher coverage (p-value 2.2e-16), a finding consistent with a previously published study
Variation calling resulted in 836 distinct single nucleotide variants (SNVs) in known cancer genes across the 33 samples. Cell lines had significantly more SNVs in cancer genes than patient samples (p-value <0.001); on average 153 SNVs were detected per cell line and 117 per patient sample. 56% of the predicted SNVs were reported in dbSNP (
To validate the mutations found in cell lines, we compared our results with mutations determined by the Cancer Cell Line project
Thirteen of the 58 cancer genes have been linked specifically to T-ALL, and we identified protein altering mutations in at least one of these genes in all cell lines and in 10 patient samples (
We next determined if mutations in cancer genes could be identified that were previously not linked to T-ALL. We found several such mutations in T-ALL cell lines (
We identified several mutations in JAK2 and JAK3 in both cell lines and patient samples. All JAK kinases, except TYK2 (see below), are known oncogenes in leukemia and activating mutations and translocations affecting JAK1, JAK2 and JAK3 were described in multiple, mainly myeloid, hematologic malignancies
(A) Sanger sequencing chromatograms corresponding to confirmed JAK2/JAK3 variants. (B) Domain structure of JAK2 and JAK3 proteins with indication of novel detected variants. Non-somatic variants are indicated with an asterisk. (C) Sanger sequences showing examples of TYK2 variants detect in T-ALL cell lines or in leukemia patient samples. (D) Schematic representation of TYK2 protein structure with indication of all novel TYK2 variants detected in this study. Non-somatic variants are indicated with an asterisk.
Searching for novel T-ALL driver genes can be performed by whole-exome sequencing or other genome-wide approaches. Nevertheless, the Roche/454 platform combined with sequence capture could be useful in a candidate gene approach. In our targeted re-sequencing approach, 39 genes were included that were not causally linked to cancer, but were selected as candidate oncogenes or tumor suppressor genes, because of their function (e.g., tyrosine kinases and tyrosine phosphatases) or because family members had been implicated in cancer (e.g., TYK2 for the JAK family, TET1 because TET2 is a known cancer gene).
Interestingly, 4 of the 15 sequenced patient samples contain a variation in TET1. The
(A) Sanger sequencing chromatograms representing confimed TET1 variants. (B) Schematic representation of TET1 protein structure with indication of all novel TET1 variants detected in this study. Variants detected in cell lines are depicted above the TET1 protein, variants detected in leukemia patient samples are below the TET1 protein. Non-somatic variants are indicated with an asterisk.
Mutations in tyrosine phosphatase genes, that act as negative regulators of tyrosine signaling, were identified in many T-ALL cell lines and also in several T-ALL patients. Additional mutations in SPRY genes, negative regulators of the RAS/MAPK pathway, were also detected. We identified a homozygous variation in
(A) Sanger sequencing chromatograms showing confirmed SPRY4 variants. (B) Domain structure of the SPRY4 protein with indication of novel detected variants.
Finally, we also identified several mutations in tyrosine kinases (IGF1R, TYK2, TNK1, and MST1R) and associated signaling proteins (IRS2, SOCS3), but the majority of these mutations were found in cell lines, while primary patient samples showed a much lower frequency of these mutations. The most frequently mutated gene across all cell lines and patient samples was the insulin receptor substrate 2 (IRS2) gene, showing non-synonymous coding mutations in 6 cell lines and in one patient sample. Also frequently mutated was TYK2, with mutations observed in 6 cell lines; one stop-gain variant and 5 non-synonymous coding variants. Although none of the 15 patient samples carried a mutation in TYK2, it could be present at low frequency in patients. To test this, we performed complementary sequencing of TYK2 in 93 T-ALL, 54 AML and 53 B-ALL patient samples. Despite the high frequency of TYK2 variations in T-ALL cell lines, TYK2 variants were detected only in 2 of 93 T-ALL and 1 of 54 AML cases
The mutation frequency of TYK2 in T-ALL cell lines compared to primary T-ALL samples was substantially different, with a high mutation rate of TYK2 in cell lines, but only a low mutation rate in primary samples. To determine if this could be due to the accumulation of TYK2 mutations during culturing of the cells, we sequenced TYK2 in different clones of the same T-ALL cell line
These data confirm important differences between cell lines and primary patient samples, which may reflect the accumulation of mutations during
Cell line | Tested variant | Result |
CCRF-CEM Cools lab | R1027H | present |
CCRF-CEM 2011 DSMZ (ACC240) | R1027H | present |
CCRF-CEM subclone 1 DSMZ | R1027H | present |
CCRF-CEM subclone 2 DSMZ | R1027H | present |
CCRF-CEM subclone 3 DSMZ | R1027H | present |
CCRF-CEM subclone 4 DSMZ | R1027H | present |
CCRF-CEM subclone 5 DSMZ | R1027H | present |
CCRF-CEM Cools lab | A35V | present |
CCRF-CEM 2011 DSMZ (ACC 240) | A35V | present |
CCRF-CEM subclone 1 DSMZ | A35V | absent |
CCRF-CEM subclone 2 DSMZ | A35V | absent |
CCRF-CEM subclone 3 DSMZ | A35V | absent |
CCRF-CEM subclone 4 DSMZ | A35V | absent |
CCRF-CEM subclone 5 DSMZ | A35V | absent |
KARPAS-45 Cools lab | Q830 |
present |
KARPAS-45 2011 DSMZ (ACC105) | Q830 |
present |
KARPAS-45 1994 DSMZ (ACC105) | Q830 |
present |
JURKAT Cools lab | C192Y | present |
JURKAT 2011 DSMZ (ACC 282) | C192Y | absent |
JURKAT 1992 DSMZ (ACC 282) | C192Y | absent |
Presence of the TYK2 R1027 and A35V variants was tested in the CCRF-CEM cell line from our group (“CCRF-CEM Cools lab”) as well as in the CCRF-CEM cell line as it is currently sold by DSMZ (“CCRF-CEM 2011 DSMZ (ACC240)) and in 5 different CCRF-CEM subclones that DSMZ collected over the years. Similarly, KARPAS-45 from the Cools lab and the KARPAS-45 lines obtained from DSMZ in 2011 and in 1994 were screened for presence of the TYK2 Q830* variant. JURKAT cells from the Cools lab as well as JURKAT provided by DSMZ in 2011 and 1992 were tested for the TYK2 C192Y variant.
This cell line has 4 copies of chromosome 19 containing TYK2. The height of the variant peak on the chromatogram suggests that only 1 copy of TYK2 contains the Q830* variant.
We demonstrated that the targeted sequencing approach with an optimized analysis setting can be used to identify oncogenic mutations. This approach could be of particular interest for the detection of point mutations in a set of important oncogenes and tumor suppressors or other disease related genes for diagnosis, prognosis prediction or therapy choice. Such information could be generated in a relatively short timeframe and with unprecedented detail. One of the major advantages over classical Sanger sequencing is the higher throughput of this method allowing that all exons of a gene set of this size can easily be sequenced. As such, full information is provided and rare variants or even previously undiscovered mutations in a particular gene can be detected. Indeed, of the 160 exonic and splice site variants (excluding the 61 synonymous variations) detected in the cell lines and patient samples across our panel of cancer genes, only 40 are found in the COSMIC database
To detect mutations using next-generation sequencing - either to replace or complement molecular diagnosis - standardized bioinformatics analysis pipelines with very high accuracy are required. Such a pipeline consists of a mapping algorithm to align the sequence reads to the reference genome, a variation calling algorithm to identify differences between the sample and the reference, and a variation filtering algorithm.
We compared multiple combinations of mapping and variation calling algorithms, and found that combining two mappers, namely SSAHA-2 and BWA-SW, followed by Atlas-SNP2 yields the most accurate variation detection results. Adding two mapping algorithms filters out false positive variant predictions due to erronous mapping, and the error model of Atlas-SNP2 enables the elimination of reads that have multiple best matches in the reference genome. We also found that additional data filters on depth of coverage and on variant allele frequency further increased both the sensitivity and specifity of variation detection.
We encountered several technical limitations during data analysis. First, we had to remove duplicate reads introduced by PCR amplification steps during sample preparation since we noticed these were causing false positive SNV predicitons. Second, we could only predict SNVs, while indels (small insertions and deletions) had to be ignored since our work (data not shown) and previous studies indicate that 454 reads are not suited for indel detection due to the large amount of false positive results
We believe that using a long read sequencing technology, such as Roche/454 or the more recent Pacific Bioscience, provides particular advantages with regards to both sensitivity and specificity of variation detection. First, long read alignment allows better distinction between highly similar genes in the genome. For example, one of the genes we re-sequenced was NOTCH1, a gene with multiple homologs (namely NOTCH2, NOTCH2Nl, NOTCH3 and NOTCH4). However, we observed no reads mapping to any of these homologs, even though we mapped the reads to the entire genome. This indicates that both the sequence capture and the mapping were specific. On the other hand, we also encountered an example where the sequence capture was not specific. Namely, the PMS2 gene is one of the targeted genes in our study, yet we observed reads mapping to the PMS2 pseudogene, PMS2CL, which contains the first six exons of PMS2 gene. Thanks to the use of long reads, this causes no problems for variation detection because for each gene the respective reads mapped
Second, mapping long reads to a reference genome is more robust towards extensive local variation, which can be present in particular genomic regions, or can be higher when samples are sequenced from a different ethnicity compared to the reference genome
We applied our analysis strategy to T-ALL by sequencing a set of 97 genes. This set consists of 58 known oncogenes and tumor suppressors in T-ALL and other cancers, and 39 genes selected via a candidate approach. Regarding the identification of variations in these genes using 454 sequencing and our optimal optimized analysis pipeline, we reached 95% sensitivity and 93% specificity on a confirmation set of 210 variants validated by capillary sequencing. Furthermore, we detected 85.7% of the mutations reported in 11 cell lines that were also sequenced in the Cancer Cell Line project. High performance of our resequencing approach is also illustrated by the fact that we identified mutations in known candidate drivers in T-ALL that were included in the collection of known cancer genes such as
We detected mutations in several known cancer genes where a link to T-ALL has not been established yet, such as JAK3. Interestingly, a recent article confirmed the mutation status of this gene in the context of T-ALL
It is remarkable that more novel sequence variants are found per cell line sample than per patient, and that genes were in general more frequently mutated in cell lines than in patients. Excessive gene mutations can be explained by potential genomic instability of cells in culture, or can be caused by
It is nevertheless interesting to note that this tendency of higher mutation frequency in cell lines compared to patient samples does not extend to all analyzed genes. The most evident example is
In conclusion, we describe a method for fast re-sequencing of a moderate size gene set of 97 genes using 454 next generation sequencing equipment that would be suitable for implementation into the clinic. Our results show that this setting is useful to identify (i) known mutations in known driver genes; (ii) new mutations in known drivers; and (iii) oncogenes or tumor suppressors that had not previously been associated with a specific subtype of cancer based on a candidate gene approach.
The optimized data analysis pipeline, which was assembled from publicly available tools, slightly exceeded the performance of the Roche gsMapper software with 95% sensitivity and 93% specificity for SNV detection, and subsequent analysis of the Roche/454 data from the T-ALL cell lines and patient samples confirmed previously known oncogenes and tumor suppressors in T-ALL and identified previously unrecognized rare somatic mutations in
97 genes were selected for sequencing in this study. The gene set consists of genes that are known to be involved in oncogenesis of T-ALL (and other cancer types), and a large set of kinases and phosphotases due to their potential therapeutic value. In total, 56 of the selected genes have been causally implicated in cancer according to Census
All T-ALL cell lines originated from DSMZ (Braunschweig, Germany). Samples from patients with T-ALL (n = 93), Acute myeloid leukemia (AML) (n = 54) and B-cell acute lymphoblastic leukemia (B-ALL) (n = 53), obtained at diagnosis and remission samples from T-ALL patients (n = 42) were collected at the University Hospital Leuven and VU Medical Center Amsterdam. Diagnosis of T-ALL, AML or B-ALL was based on morphology, cytogenetics and immunophenotyping according to the World Health Organization and European Group for the Immunological Characterization of Leukemias (EGIL) criteria. Informed consent was obtained from all subjects and experiments were approved by the ethical committee of the University Hospital Leuven.
Preparation of a shot-gun DNA sequencing library and capture of the exons, with flanking intron junctions of 97 genes (
The performance of the alignment and variant calling algorithms was evaluated to determine the optimal method for analyzing 454 reads. Eight analysis pipelines were constructed from long read aligners BWA-SW
The pipelines were implemented and reviewed on 7 cell lines:
Initial SNV predictions were performed with following settings:
SAMTools: with pileup -c command, with total coverage threshold of 3 and SNP quality threshold of 20
VarScan: with pileup2snp –min_coverage 3–min_reads2 2 min_avg_qual 15–min_var_freq 0.01–p_value 0.99
Atlas-SNP2: with total coverage threshold of 3
gsMapper: with
There must be at least 3 reads with the difference.
There must be both forward and reverse reads showing the difference, unless there are at least 5 reads with quality scores over 20 (or 30 if the difference involves a 5-mer or higher).
If the difference is a single-base overcall or undercall, then the reads with the differences must form the consensus of the sequenced reads.
The SNPs to be confirmed with capillary sequencing were selected from the predictions generated with these settings.
Then, predictions from the pipelines were filtered with varying VAF and DoC thresholds. Two VAF thresholds (0.20 and 0.30) and two DoC thresholds (3 and 10) were used. SAMTools pipelines were also processed with
The performance of each pipeline was evaluated by Sanger resequencing of 210 variants that were sampled from the pooled set of all predicted variants from all pipelines
Whole genome amplified DNA (REPLI-g system, Qiagen, Hildenberg, Germany) from primary leukemia or remission samples was used as template for PCR amplification of indicated genes. PCR products were Sanger sequenced and inspected for the presence of sequence variants using Mutation Surveyer software (Softgenetics, State College, PA) and CLC DNA Workbench 6 (CLC Bio, Aarhus, Denmark). All variants that were detected in whole genome amplified material were subsequently validated in non-amplified original patient material. Primer sequences are available upon request.
Sequence data has been deposited at the European Genome-phenome Archive (EGA,
(PDF)
(PDF)
(PDF)
(PDF)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLSX)
(XLS)