Conceived and designed the experiments: BBT AJB RT PMB SK MAB FMDLV DIS. Performed the experiments: RRL JG CBC CKM SJS AJB. Analyzed the data: BBT XX YWA YAS. Contributed reagents/materials/analysis tools: BBT KDO JLK EJM MWM SK DIS. Wrote the paper: BBT RRL FMDLV DIS. Project Management: ASS.
Some of the authors of this manuscript are or have been employees of Life Technologies Inc., which manufactures the sequencing instrument and some materials used in this work. This does not alter the authors' adherence to the PLoS ONE policies on sharing of data and materials.
Due to growing throughput and shrinking cost, massively parallel sequencing is rapidly becoming an attractive alternative to microarrays for the genome-wide study of gene expression and copy number alterations in primary tumors. The sequencing of transcripts (RNA-Seq) should offer several advantages over microarray-based methods, including the ability to detect somatic mutations and accurately measure allele-specific expression. To investigate these advantages we have applied a novel, strand-specific RNA-Seq method to tumors and matched normal tissue from three patients with oral squamous cell carcinomas. Additionally, to better understand the genomic determinants of the gene expression changes observed, we have sequenced the tumor and normal genomes of one of these patients. We demonstrate here that our RNA-Seq method accurately measures allelic imbalance and that measurement on the genome-wide scale yields novel insights into cancer etiology. As expected, the set of genes differentially expressed in the tumors is enriched for cell adhesion and differentiation functions, but, unexpectedly, the set of allelically imbalanced genes is also enriched for these same cancer-related functions. By comparing the transcriptomic perturbations observed in one patient to his underlying normal and tumor genomes, we find that allelic imbalance in the tumor is associated with copy number mutations and that copy number mutations are, in turn, strongly associated with changes in transcript abundance. These results support a model in which allele-specific deletions and duplications drive allele-specific changes in gene expression in the developing tumor.
The development of tools for measuring gene expression and structural variation across the entire genome has revolutionized our ability to characterize cancers at the molecular level. However, such tools have typically relied on microarray hybridization, which has limited sensitivity and is susceptible to the effects of cross-hybridization between homologous DNA fragments. The recent advent of massively parallel sequencing has provided a more powerful tool to study changes in transcriptomes and genomes, through what is termed RNA sequencing (RNA-Seq)
Cancers of the head and neck are the sixth most commonly observed cancers worldwide
Here we have paired a new strand-specific whole transcriptome library preparation method with massively parallel ligation sequencing to study the transcriptomes of three oral squamous cell carcinoma (OSCC) tumors and three matched normal tissues. With the resulting 60 Gb of sequence we performed two types of analyses. First, we examined differential expression of genes between tumor and normal tissue across the three patients and compared these results to those produced by microarray and RT-qPCR. The comparison reveals strong concordance between the methods, with RNA-Seq outperforming microarrays at measurement of the low abundance transcripts. Second, we investigated the extent and types of allelic imbalance (AI) observed between the tumor and normal tissues of the three patients. Here we focus on relative AI, which compares the ratio of the expression of two alleles in one sample (e.g., tumor tissue) to that in another sample (e.g., matched normal tissue). AI represents a convolution of genotype and expression level that can arise due to a number of different processes. Our analysis demonstrates the ability of RNA-Seq to accurately measure AI and the utility of AI for understanding cancer development. Unlike other methods, our RNA-Seq approach surveys strand-specific expression across the entire length of transcripts, allowing us to observe bidirectional promoter usage and improving our chances of covering the rare heterozygous SNPs needed for AI analysis.
We have also sequenced the tumor and normal genomes of one of the three patients and determined copy number changes present in the tumor genome. By comparing genomic and transcriptomic data from this patient, we observe that changes in gene dosage are strongly associated with changes in gene expression and allelic imbalance in this tumor. These data are consistent with a model in which allele-specific duplication and deletion drive allele-specific changes in gene expression
We sequenced rRNA-depleted total RNA extracted from tumor and matched normal tissue from three patients with oral squamous cell carcinoma (OSCC) (
Read counts listed in the middle section are expressed in millions (left column) or as a percentage of the total reads processed (right column) for each sample.
Given that RNA-Seq is still an emerging methodology, we wanted to ensure that our method provides gene expression measurements that are consistent with orthogonal technologies (e.g., RT-qPCR and microarray). We assessed concordance with these other methods at the level of differential gene expression between tumor and normal tissue samples (
We next investigated the biological significance of the tumor versus normal (TvN) expression profiles. First, by hierarchically clustering the log-transformed transcript expression levels in the six samples, we established that the same types of tissue from different patients are more similar than different types of tissue from the same patient (
(A) Transcript expression levels in each of the six samples were hierarchically clustered and, as expected, the normal and tumor tissues form tight clusters. Shades of blue indicate lowered expression, relative to the mean across samples, whereas shades of yellow indicate higher expression relative to the mean. (B) For each patient, gene expression in the tumor was compared to that in matched normal tissue. Pearson correlations indicate strong and significant (P<10−16) similarity of differential transcript expression across the three patients. (C) A scatterplot, comparing differential transcript expression between patients 8 and 33.
To isolate the set of genes commonly mis-regulated in the development of OSCC, we took a rank-order approach (
(A–D) Examples of gene expression at four loci. Plotted across each locus is the normalized sequence coverage on both the plus (colored red) and minus (colored orange) strands, for the tumor and normal tissue of a particular patient. (A) MMP1 in patient 51; y-axis scale is 10 to 2000. (B) INHBA in patient 8; y-axis scale is 10 to 150. (C) HMGA2 and RPSAS52 in patient 8; y-axis scale is 10 to 300 for the plus strand and 10 to 100 for the minus strand. (D) CASQ1 in patient 33; y-axis scale is 10 to 100. (E–F) The most up-regulated (E) and down-regulated (F) genes in the tumors of the three patients were submitted for gene ontology (GO) analysis
To more systematically assess the biological functions of commonly mis-regulated genes, we performed gene ontology (GO) analysis
The results presented here are also consistent with existing knowledge of gene expression changes in OSCC development. For example, a meta-analysis of 41 head and neck squamous cell carcinoma (HNSCC; OSCC is one type of HNSCC) expression profiling studies identified 25 genes that were differentially expressed between tumor and normal tissue in nine or more of these studies
One advantage of our RNA-Seq method is that it allows us to address questions that were previously inaccessible on the genome-wide scale. An example is allelic imbalance (AI), which we define here as a difference in the nucleotide frequencies at a given transcriptome position between two tissues/conditions (
Allelic imbalance (AI) at the RNA level can arise in a number of ways, including through point mutation and changes in the relative expression of alleles (aka, allele-specific expression). (A) Illustrated is an example of two pre-existing alleles (G and T), one of which undergoes a linked promoter mutation (red asterisk) in the tumor, relative to the normal tissue. If, for example, the mutation alters a
In principle, our depth of sequencing is sufficient to detect AI in a large number of exons (
We next developed a method to detect relative AI between samples (
To assess whether or not the allelically imbalanced genes are biologically meaningful and related to the development of this cancer, we performed GO analysis
Of these 52 genes with AI in two or more patients, 23 appeared particularly interesting and are highlighted in
One particularly convincing example of AI is observed for transcripts of the adjacent IGF2 and H19 genes (
There are six sites in H19 (shaded orange) and one in IGF2 (shaded blue) on chromosome 11 that are apparently heterozygous in patient 8 (five were validated by as-qPCR). In normal tissue, most detectable expression of H19 is from one allele, as expected for this imprinted, maternally expressed gene. Unexpectedly, in the tumor, nearly all detected expression is from the other, presumably paternal, allele. Observed nucleotides are from dbSNP
In summary, our whole transcriptome sequencing approach allows detection of AI across a large number of genes. AI is a widespread phenomenon in oral cancer development, impacting genes known or likely to be involved in cancer etiology. AI's presence can be used effectively to identify key cancer development events, such as non-synonymous point mutations, loss of imprinting, loss of heterozygosity, and copy number changes resulting in up-regulation of one allele relative to the other. However, resolving ambiguity between these possible underlying causes currently requires follow-up experiments.
There is abundant copy number variation (CNV) in mammalian populations
To compare the CNCs observed between normal and tumor tissue to changes in gene expression, we calculated the differential expression of each genomic segment (
(A) A strong correlation (ρ = 0.73) is observed between changes in copy number and changes in gene expression for patient 8. The correlation is stronger (ρ = 0.84) if only copy number changes greater than 1.4-fold are considered. (B) The most strongly amplified region (9-fold more copies in the tumor than normal; chr11:68,503,204-69,987,273) contains several differentially expressed (red and orange tracks) genes, highlighted in the text. (C) One region (chr9:21,973,361-22,061,522) that is likely to have been deleted in the tumor contains two genes of interest: cyclin-dependent kinase inhibitor 2B (CDKN2B) and cyclin-dependent kinase inhibitor 2A (CDKN2A). (D) Given that gene dosage changes are strongly associated with gene expression changes, it is expected that heterozygous amplifications and deletions will be associated with the allelic imbalance of transcripts. Shown are the distributions of allelic imbalance for genomic regions that fall into one of three categories of log-transformed copy number change (CNC): low or no CNC (blue; |CNC|<1.2), moderate CNC (red; 1.2<|CNC|<1.8), and large CNC (yellow; |CNC|>1.8). The moderate and large CNC distributions are shifted to significantly higher values of allelic imbalance compared to the no CNC distribution (Mann-Whitney P-values of 10−10 and 10−3, respectively).
One particularly remarkable example of concordant change in CN and gene expression involves the 9x amplification of a 1.5 Mb segment of chromosome 11 (
Another striking example of concordant copy number and gene expression change occurs on chromosome 9, where a 90 kb genomic segment, containing two genes, has apparently been deleted in the tumor (
It is also worth noting the amplification of Wnt inhibitory factor 1 (WIF1), which encodes an extracellular component of the Wnt pathway that plays a critical role in regulating cell adhesion, proliferation and differentiation
Finally, we examined the relationship between allelic imbalance and copy number change. Given that gene dosage changes are strongly associated with gene expression changes, it is expected that heterozygous amplifications and deletions will be associated with the allelic imbalance of transcripts. For example, if one allele of a genomic region is amplified 10-fold relative to the other allele, then we might expect to see a 10-fold imbalance of heterozygous SNPs that fall within the region (depending on the exact mechanism of amplification). We grouped each copy number segment into one of three log2-transformed CNC categories, disregarding the direction of change: low or no CNC (|CNC|<1.2), moderate CNC (1.2<|CNC|<1.8), and large CNC (|CNC|>1.8). We then considered the distributions of AI for genomic regions in each of those CNC categories (
Taken together, these results show a strong relationship between structural mutations and changes in gene expression in this patient's tumor. Increased gene dosage is associated with increased gene expression and decreased gene dosage is typically associated with decreased gene expression. Furthermore, gene dosage changes are linked with changes in the relative expression of alleles. The simplest interpretation is that the allele-specific structural mutations in this patient's tumor have driven the observed changes in gene expression.
By sequencing the tumor and normal transcriptomes of three individuals with OSCC, we have characterized, in depth, the changes in gene expression associated with development of this cancer. We have demonstrated that our RNA-Seq method can be successfully applied to profile gene expression in tumor and matched normal tissues, in much the same way that microarrays have been applied to this task in the past. The gene expression profiling results we produced by sequencing are largely concordant with those we produced from hybridizing the same samples to microarrays (
Not only have we demonstrated the overall similarity of results obtained by deep sequencing and microarray hybridization of the same samples, but we have also shown that both sets of results are in strong agreement with existing knowledge of this cancer. Development of OSCC involves perturbed regulation of genes functioning in interaction with the external environment, such as those functioning in cell adhesion and encoding components of the extra-cellular matrix (
Although deep sequencing, like microarray hybridization, is well suited to the task of profiling gene expression across tissues and individuals, it is also capable of interrogating aspects of gene expression that have typically eluded microarrays (e.g., detecting the presence of novel, alternative splice forms
Finally, the combination of transcriptome sequencing and genome sequencing affords the opportunity to characterize the genomic mutations underlying alterations in gene expression in a tumor. We sequenced the tumor and normal genomes of one patient and used these data to determine copy number changes (CNCs) between the two samples. Highly concordant results were obtained by other methodologies (qPCR and aCGH;
This study was conducted according to the principles expressed in the Declaration of Helsinki. The study was approved by the Institutional Review Board of the Mayo Clinic. All patients provided written informed consent for the collection of samples and subsequent analysis.
All tissues used in this study were collected from patients undergoing treatment at Mayo Clinic, Rochester, MN. Tumor samples were obtained from patients undergoing surgical treatment for oral squamous cell carcinoma (OSSC). Normal samples were collected from the negative surgical margins. Following surgical excision, a portion of the tissue was immediately processed and snap-frozen in liquid nitrogen for storage and future use. The remainder was processed for clinical examination and long-term storage in the tissue archives of Mayo Clinic, Rochester MN, according to standard clinical protocols. All patients consented to the use of their tissue for research and this study was approved by the Mayo Clinic Institutional Review Board. Each collected sample was frozen sectioned, changing the blade between samples, and mounted on positively charged glass slides that were stained with haematoxylin and eosin (H&E) by the Mayo Tissue and Cell Molecular Analysis Core facility. These slides were then evaluated by a qualified Mayo Pathologist (J. Lewis) to confirm the presence or absence of tumor in each sample. Appropriate tumor was circled and extreme care was taken to obtain only tumor-containing sections for subsequent isolation of DNA or RNA.
Frozen tissues were compared to corresponding H&E slides following evaluation by pathology to verify classification of tumor or normal tissue status. Portions of tissue were removed for nucleic acid extraction, using disposable scalpels, in quantities <30mg. DNA was extracted from frozen tissue using the Invitrogen PureLink Genomic DNA Mini kit (Carlsbad, CA) according to the manufacturer's protocol. Total RNA was extracted from portions of the frozen tissue samples using the Qiagen RNAeasy Plus Kit (Valencia, CA) according to the manufacturer's protocol. Isolated DNA and RNA were quantified by NanoDrop ND1000 (ThermoFisher Scientific, Waltham, MA). RNA samples were further assessed for quality using the Agilent 2100 Bioanalyzer (Santa Clara, CA) prior to library construction.
To construct libraries suitable for SOLiD™ System sequencing, 5 ug of total RNA was depleted of 18S and 28S rRNA using GLOBINclear™ (Ambion) buffers and reagents supplemented with biotinylated capture probes designed against these rRNAs and following the given protocol. The rRNA depleted samples (∼1 ug) were then fragmented by incubation with 1 unit of RNase III (Ambion) for 10 minutes in a 10 ul reaction volume containing 1X RNase III buffer supplied with the enzyme. The samples were then mixed with formamide gel loading dye and denatured for 10 min at 95°C and then separated on a flashPAGE™ gel apparatus using a modified procedure. The flashPAGE™ gel was first run for 15 minutes as per the given procedure and conditions. The lower running buffer was then removed and the lower chamber rinsed with nuclease-free water 2 times. The lower chamber was then replenished with fresh buffer and the gel was run for an additional 45 minutes. The lower running buffer was then removed and the RNA was purified using the flashPAGE™ clean up kit, producing RNA fragments ranging from ∼50–150 nt in size. This RNA was then used with the SOLiD Small RNA Expression Kit (Ambion) as per the given protocol, except the size range of products purified from the 6% native PAGE step was ∼140–200 bp in size. The final purified products were quantitated using a nanodrop and the size range of the products was confirmed by bioanalyzer analysis. The samples were then diluted and used for emulsion PCR.
10 ug of each DNA sample was used to generate mate-paired libraries with a 2.5 kb insert size using standard manufacturer protocols. Briefly, DNA was sheared to a target size of 2.5 kb using a HydroShear® (Genomic Solutions). The resulting fragments were end repaired using the End-It™ (Epicentre) kit, methylated to protect EcoP151 sites, and ligated to CAP adapters. The DNA was then size selected by electrophoresis on a 1% agarose gel, and a band 2 kb to 3 kb was excised from the gel using a scalpel blade. DNA was recovered from the gel using QIAquick Gel Extraction Kit (Qiagen). The resulting DNA fragments were circularized by ligation to an internal adapter, and 25–27 bp ‘mates’ created by digestion with the type III restriction enzyme EcoP151. Double stranded P1 and P2 sequencing adapters were then ligated to the library and nick translated before final amplification using 14–15 cycles of PCR. Libraries were again purified by electrophoresis on a 3% agarose gel, excising the appropriate library band and recovering the DNA using the QIAquick Gel Extraction Kit (Qiagen). The size, quantity and quality of the resulting libraries were confirmed by analysis on a 2100 Bioanalyzer (Agilent) using a DNA 1000 chip before the library was diluted and used for emulsion PCR.
Templated beads were generated for sequencing using standard manufacturers' protocols. Briefly, an aqueous phase was prepared from the SOLiD ePCR kit containing AmpliTaq Gold DNA Polymerase UP, buffer, MgCl2, dNTP's, amplification primers and library template. The aqueous phase was then introduced to a whirling oil phase in an ULTRA-TURRAX® Turbo Drive (IKA) to create a water-in-oil emulsion. The emulsion was then transferred to a 96 well plate and thermocycled using the recommended PCR conditions. After PCR amplification, emulsions were broken using butanol, and the beads were washed, enriched, and terminal transferased before quantification and deposition onto a slide for sequencing.
Templated beads were deposited onto one full slide per sample. Massively parallel ligation sequencing was carried out to 50 bases using Applied Biosystems SOLiD System (V3 chemistry) and following the manufacturer's instructions.
Templated beads for the normal and tumor samples were deposited across two sequencing slides, three quadrants per sample. Both forward and reverse tags from the mate-paired libraries were sequenced to 25 bases, using Applied Biosystems SOLiD System (V3 chemistry) and following the manufacturer's instructions.
Whole transcriptome reads were aligned using AB's SOLiD Whole Transcriptome Pipeline
Mapping and pairing were performed with Applied Biosystems' alignment and pairing package (corona_lite,
For each of the 18,095 RefSeq transcripts, reads uniquely aligned within its genome-mapped exons were summed. One pseudo-count was added to this sum and the resulting modified raw transcript count was divided by the total number of uniquely aligned reads for the sample, yielding normalized transcript counts for each RefSeq transcript in each sample. Normalized transcript counts and TvN fold-changes can be found in
To isolate the set of genes commonly mis-regulated in the development of OSCC, we rank ordered the TvN fold-changes for each patient. We then ranked transcripts by their median TvN rank across patients, considering further only the three hundred highest and three hundred lowest ranking genes as the sets of genes commonly up-regulated and down-regulated in OSCC development, respectively. Gene sets can be found in
Although rank ordering was our original approach, we have also employed a recently proposed likelihood ratio test combined with a fold-change cutoff to define sets of mis-regulated genes
mRNA samples were reverse transcribed by priming off the polyA tail, producing cDNA pools. Each cDNA pool was amplified by
cDNA was produced using 2 ug of isolated RNA and the High Capacity cDNA Reverse Transcription Kit with RNase Inhibitor (P/N 4374966, Applied Biosystems, Foster City, CA). A total of 250 ng of the cDNA product was preamplified using the Applied Biosystems TaqMan® PreAmp Master Mix Kit according to standard protocol (P/N 4391128). Briefly, 250 ng of cDNA was amplified for 14 cycles with a pool of 20X TaqMan Gene Expression Assays specific to the target genes (P/Ns 4331182 and 4351372, Applied Biosystems). Following preamplification, the product was diluted 1:20 in 1 X TE buffer. The diluted preamplified cDNA was used in individual quantitative PCR reactions including TaqMan gene expression assays to measure the expression levels of target genes. To normalize the expression levels of each target gene, the cycle threshold (CT) of an abundantly-expressed control gene (GADPH, GUSB, PGK1, TBP; P/Ns 4333764F, 4333767F, 4333765F and 4333769F) was subtracted from the CT for each target gene of interest. This value was then used to calculate log gene expression changes across samples/conditions (ddCT).
Allelic imbalance (AI) is a difference in the nucleotide frequencies at a given genomic position between two RNA samples. To quantify AI at each position in the genome, our method first tallies di-colors from reads aligned to that position that represent either a reference nucleotide or a single nucleotide substitution, filtering from further analysis invalid di-colors, which likely represent sequencing errors or more complex mutations. The nucleotide frequencies are then compared between samples and the significance of the AI is determined by applying a χ2 test of independence (on a 2×4 contingency table).
In our first attempt to investigate AI we only required that each genomic position have: (1) at least 15x coverage in the tumor and normal samples, (2) significant AI with χ2
cDNA was produced by reverse transcription using the High Capacity Reverse Transcription Kit (P/N 4322171, Applied Biosystems) and manufacturer's instructions. as-qPCR and as-RT-qPCR were performed on an Applied Biosystems 7900HT Sequence Detection System. The 10 µl PCR mixture contains diluted RT products (for as-RT-qPCR) or 3 ng genomic DNA (for as-qPCR), 1x TaqMan® Genotyping Master Mixture (P/N 4371357, Applied Biosystems), 0.3 µM allele-specific forward primer, 0.2 µm TaqMan® probe, and 0.9 µm reverse primer. The reactions were incubated in 384-well plate at 95°C for 10 minutes, followed by 2 cycles of 95°C for 15 seconds and 58°C for 1 minute, and 48 cycles of 95°C for 15 seconds and 60°C for 1 minute. All allele-specific forward primers, TanMan® probes and reverse primers were manufactured at Applied Biosystems. For each genomic coordinate of interest, each allele was quantified (in each gDNA and cDNA sample) by obtaining CT values from at least two (and typically three) replicate reactions. The mean across reactions was computed and a dCT value calculated by subtracting the mean CT value for the second allele from the mean CT value for the first allele. Genomic coordinates with gDNA assays yielding absolute dCT values greater than 4.0 were deemed to be homozygous for the allele with lower CT value, while all others were assigned a heterozygous genotype (
The ratio of the number of uniquely aligned reads in paired tumor and normal samples for any given genomic sequence window is an estimate of the copy number change in the window. We modified Segseq
To compare the CN changes observed between normal and tumor tissue to changes in gene expression, we calculated the differential expression of each genomic segment simply by summing the number of reads uniquely aligned to this region in the tumor sample and dividing it by the number of uniquely aligned reads to this region in the normal tissue sample. The resulting ratio was normalized by the ratio of the total reads uniquely aligned for each sample.
TaqMan Copy Number (CN) Assays were performed according to the manufacturer's instructions (P/Ns 4400293 and 4400296, Applied Biosystems, Foster City, CA). In all, the copy numbers of 23 genes of interest were measured across 23 samples, derived from six normal tissues, 14 tumors, and two Coriell Institute for Medical Research (Camden, NJ) gDNA controls (see
aCGH was performed using the Agilent Human Genome Microarray Kit 244K (Agilent Technologies, Santa Clara, CA) which contains ∼244,000 60-mer oligonucleotide probes spanning coding and non-coding genomic sequences with median spacing of 7.4 and 16.5 kb respectively. Arrays were analyzed using the Genepix 4200A scanner (Axon Instruments, Union City, CA) and the Agilent Feature Extraction software (v9.1). Copy number segments were obtained with the Agilent CGH Analytics software (v.3.4), using the ADM-1 algorithm and default settings
Whole transcriptome (WT) experimental protocol. The protocol used to prepare total RNA for SOLiD sequencing is diagrammed above. This approach achieves strand-specificity by employing end-specific ligation of sequencing adapters to RNA, prior to the cDNA synthesis step. The P1 sequencing adapter is an RNA/DNA complex that contains a 6 bp 3′ single-strand DNA overhang allowing it to hybridize only to the 5′ end of an RNA fragment/molecule and, likewise, the P2 adapter will hybridize only to the 3′ end. The ligase used is engineered specifically to prefer the types of double-stranded substrates produced by these hybridizations, effectively making proper hybridization a prerequisite for efficient ligation. So, when cDNA is sequenced off the P1 adapter we expect the read sequence to represent the underlying RNA in the 5′->3′ orientation and thus, after alignment, we can work out the genomic strand from which the RNA originated. Also, because RNA is fragmented prior to cDNA synthesis, the protocol is less biased with respect to the positional origin of inserts within transcripts.
(0.98 MB TIF)
Whole transcriptome (WT) alignment strategy. WT sequencing reads were analyzed using Applied Biosystems whole transcriptome software tools (
(7.97 MB TIF)
RNA degradation and rRNA removal. An aliquot (1 ml; ranging from ∼15–100 ng) of each of the indicated RNA samples was processed on an Agilent Bioanalzer using a standard RNA nano chip. A good quality RNA sample should primarily show two distinct products representing the 18S and 28S rRNAs and produce RIN values of ∼9 using the standard bioanalyzer conditions. While these two distinct products are visible in these samples a large number of additional products are observed migrating at various sizes, indicating that these samples are compromised by degradation to varying degrees. The N8, T8 and N33 samples showed the greatest amount of degradation (RIN values 3.2, 4.4 and 3, respectively) while T33, N51 and T51 demonstrated less degraded RNA (RIN values 5.9, 6 and 6.1, respectively). The degree of fragmentation has a negative impact on the level of rRNA that can be removed from the sample using biotinylated capture probes. Any RNA fragments that lie outside the regions covered by the capture probes will not be effectively removed and can be captured and sequenced. Therefore, degraded RNA samples are expected to produce a higher number of tags representing rRNA than high quality intact RNA samples.
(7.99 MB TIF)
Validation of SOLiD whole transcriptome analysis with other gene expression measurement platforms. (A) Comparison of log2 (Tumor/Normal) values measured by the BeadArray microarray and SOLiD sequencing platforms. Pearson correlations are shown between the platforms, both within and between patients. (B–D) For each patient, a scatterplot of log2 (Tumor/Normal) values as measured by the BeadArray microarray and SOLiD sequencing platforms is shown. Points are colored by transcript abundance (blue indicating low and red indicating high abundance; there are roughly 5000 genes in each bin), revealing greater discordance for genes with low expression. (E–F) Eight down-regulated and eight up-regulated genes with expression measurements that were discordant between the SOLiD and BeadArray platforms were chosen for validation with TaqMan gene expression assays. Displayed are scatterplot comparisons of log2 (Tumor/Normal) expression as measured by (E) SOLiD and TaqMan (ρ = 0.84), and (F) BeadArray and TaqMan (ρ = 0.71).
(9.79 MB TIF)
Overview of allelic imbalance analysis. The top row contains histograms of allelic ratios for genomic positions in patients 8, 33, and 51 (as labeled). For a given genomic coordinate, the “allelic ratio” is the log2 of the number of reads aligned across that position that indicate the reference nucleotide divided by the number of reads that indicate the first non-reference nucleotide in dbSNP. Thus we only concern ourselves here with the subset of genomic positions for which an allele is listed in dbSNP and to which at least 15 reads are aligned. As expected if alleles tend to be expressed at equal levels, we see a trimodal distribution of allelic ratios, representing the three possible genotypes: homozygous reference, heterozygous and homozygous non-reference. Allelic ratio distributions for normal tissue samples and tumors are shown in red and blue, respectively. The bottom row contains histograms of χ2 allelic imbalance P-values for genomic positions in patients 8, 33, and 51 (as labeled). The P-values indicate the extent of allelic imbalance at a transcriptomic position and are calculated in the following manner: First di-colors from reads aligned to that position that represent either a reference nucleotide or a single nucleotide substitution are tallied. Invalid di-colors, which likely represent sequencing errors or more complex mutations, are filtered from further analysis. The nucleotide frequencies are then compared between samples (normal and tumor) by applying a χ2 test of independence (on a 2×4 contingency table). The true and simulated χ2 allelic imbalance P-value distributions are shown in red and blue, respectively. The true distribution is shifted to the right relative to the simulated distribution, signifying that many transcriptomic positions differ in nucleotide frequencies between normal and tumor samples more than expected by sampling alone.
(1.65 MB TIF)
Validation of allelic imbalance. In our first round of validation, 27 transcriptomic positions identified by SOLiD sequencing to have AI were selected for genotyping and validation by allele-specific PCR (as-RT-qPCR). (A) The 27 positions and various associated statistics are listed, one position per row. The “p-value” indicates the significance of relative AI between conditions and is calculated from a χ2 test of independence on the read counts (
(3.96 MB TIF)
Validation of SOLiD CNV analysis with other CNV measurement platforms and across a panel of tumor samples. Comparison of SOLiD results to results from (A) TaqMan CNV assays and (B) Agilent 244K CNV microarrays. (A) There is strong concordance of copy number changes as measured by TaqMan and SOLiD across 23 assayed genes (ρ = 0.99). (B) Only high confidence results from the microarray platform (colored red) compare favorably with the SOLiD results (ρ = 0.97). Most of the low confidence microarray results (colored blue) are measured by a single probe, rather than multiple probes, on the array. (C) In addition to the tumor and matched normal samples of patient 8 (labeled “8_1”), the 23 TaqMan CNV assays (interrogating 23 genes of interest) were applied to a panel of 13 other tumor samples and a second section of normal/tumor tissue from patient 8 (labeled “8_2”). Two negative control gDNAs from Coriell were also assayed. Shaded in blue and yellow are genes with significant copy number decreases and increases, respectively (t-test; p-value<0.05 for two separate assays, using either the RNAseP or TERT loci as controls). Values listed are log2 (Tumor/Normalmedian). Matched normal tissues were only available for 5 of the 13 tumors, thus the median of the normal samples was used as the “normal sample” for each tumor/normal comparison and the variance for each t-test was estimated from the pool of normals. In general, the variance among normal samples is low; for patients where matched normal tissue is available, using the median normal rather than the true matched normal provides very similar measurements of copy number change.
(9.41 MB TIF)
Coverage of annotated exons, transcripts and genes.
(0.03 MB DOC)
Normalized transcript counts and differential (tumor versus normal) transcript expression values measured by RNA-Seq.
(7.66 MB XLS)
Differential (tumor versus normal) gene expression values measured by microarray hybridization and RNA-Seq.
(5.96 MB XLS)
The three hundred most up-regulated and down-regulated genes in a comparison of tumor to normal tissue across the three patients.
(0.04 MB XLS)
Genes that are allelically imbalanced in tumor versus normal comparisons of at least one patient.
(0.05 MB XLS)
Genes that are allelically imbalanced in tumor versus normal comparisons of at least two patients. Genomic positions examined and other details provided.
(0.09 MB PDF)
Copy number changes of genomic segments in patient 8.
(0.05 MB XLS)
Copy number assay details.
(0.02 MB XLS)
We would like to thank J. Lewis, Department of Laboratory Medicine and Pathology for examining all tissue slides and N. Tombers, study coordinator for the Department of Otorhinolaryngology, Mayo Clinic. We would also like to acknowledge contributions from C. Barbacioru, C. Chen, F. Hyland, E. Langit, K. Li, H. Peckham, V. Simon, Y. Wu, and Z. Zhang, laboratory assistance from J. Chan, and helpful suggestions from K. McKernan, Q. Mitrovich, R. Nutter and R Wicki.