Conceived and designed the experiments: DMR DCR SLS EOP LK. Performed the experiments: DMR DCR SLS EOP LK. Analyzed the data: DMR DCR SLS EOP LK. Contributed reagents/materials/analysis tools: DMR DCR SLS EOP LK. Wrote the paper: DMR DCR SLS EOP LK.
The authors have declared that no competing interests exist.
Personalized, or genomic, medicine entails tailoring pharmacological therapies according to individual genetic variation at genomic loci encoding proteins in drug-response pathways. It has been previously shown that steady-state mRNA expression can be used to predict the drug response (i.e., sensitivity or resistance) of non-genotyped mammalian cancer cell lines to chemotherapeutic agents. In a real-world setting, clinicians would have access to both steady-state expression levels of patient tissue(s) and a patient's genotypic profile, and yet the predictive power of transcripts versus markers is not well understood. We have previously shown that a collection of genotyped and expression-profiled yeast strains can provide a model for personalized medicine. Here we compare the predictive power of 6,229 steady-state mRNA transcript levels and 2,894 genotyped markers using a pattern recognition algorithm. We were able to predict with over 70% accuracy the drug sensitivity of 104 individual genotyped yeast strains derived from a cross between a laboratory strain and a wild isolate. We observe that, independently of drug mechanism of action, both transcripts and markers can accurately predict drug response. Marker-based prediction is usually more accurate than transcript-based prediction, likely reflecting the genetic determination of gene expression in this cross.
Realizing the promise of personalized medicine – a rational approach to tailoring pharmacological therapy to individual patients – is an area of intense research
A complementary diagnostic indicator of drug response is mRNA expression. This approach is less biased than candidate-gene approaches because it uses global transcriptional signatures of cells in the untreated state to predict drug response. Specifically, previous work, exemplified by Staunton
We sought to predict the response of each segregant to each small-molecule perturbagen (drug), or SMP, from patterns of gene expression measured in a neutral (i.e., SMP-free) medium. We classified each segregant as sensitive, resistant or partially resistant to a given SMP according to its final yield in that SMP; 225 SMP responses were tested (this represents 89 SMPs, with multiple responses to some SMPs measured at different time points and concentrations). The gene expression levels of the segregants classified as sensitive or resistant were used to train a support vector machine (SVM)
Box plots representing the distribution of prediction accuracies for all SMPs plotted against number of features selected for prediction. (A) Results of marker-based expression. (B) Results of transcript-based prediction.
After demonstrating our ability to predict SMP response using steady-state expression alone, we sought to compare these results to prediction based on genotypes. Linkage analysis is dependent on an association between a genotyped marker and a phenotype, in our case sensitivity or resistance to an SMP. Any response to an SMP that significantly links to a marker should therefore be well predicted by that same marker. We first used a much simpler algorithm than the one described above, wherein the genotype at the single most correlated marker was used to predict sensitivity or resistance. We repeated this process in a leave-one-out fashion for all classified segregants. Because we are using the most correlated marker, the response to SMPs exhibiting strong linkage should be easier to predict than response to SMPs exhibiting weak linkage or no linkage. On average we correctly predicted SMP response with 69% accuracy, but, as expected, prediction accuracy was good (75%) when a strong linkage signal was present (lod ≥4) and poor (55%) otherwise. When no strong linkage signal was present, the prediction accuracy was worse than the performance of the mode classifier, highlighting that in these instances the single most correlated marker offered almost no information to perform classification.
We further sought to examine our ability to predict more complex SMP responses (those without strong linkage results). We trained support vector classifiers using 1, 10, 50, 100, 200, 500 and 1000 highest ranked marker(s). We found the support vector classifier trained on the 500 highest-ranked markers to have the greatest predictive power overall, correctly predicting the SMP response of 71.7% of the segregants on average for the SMPs considered (
Scatter plot of prediction accuracy (in percent) of (A) transcript-based prediction or (B) marker-based prediction versus SMP lod score when the 200 best features are selected.
A direct comparison of transcript-based prediction and marker-based prediction is presented in
Plotted are maximum predictive accuracies (in percent) of transcript-based prediction (y-axis) versus marker-based prediction (x-axis). Regression line is solid black; the diagonal (x = y) is dashed black; red points denote SMPs described in the main text as that are well predicted by genotype but poorly predicted by expression.
22 SMPs are better predicted (>15% percent improvement) by markers than transcripts, while no SMPs are better predicted by transcripts than markers by the same margin. In fact, only 6 SMPs are better predicted by transcripts than markers by 10%, and of these none by greater than 12.2% (
Next we considered cases where transcript-based prediction out-performs marker-based prediction. Expression outperforms genotype for 80 SMP response predictions above the diagonal in
We next asked whether combining both transcripts and markers into a single prediction algorithm would improve our ability to predict SMP response. First, we looked at the best prediction accuracy across all feature sets of both marker- and transcript-based prediction. In 80 out of 226 SMP responses tested, the best transcript-based prediction outperformed the best marker-based prediction, with an average improvement in accuracy of 4.8%. Interestingly, there are no distinguishing mechanistic characteristics of this group of 60 SMP responses, (which, in some cases, includes the same compound tested at multiple concentrations or at multiple time points); in other words, they are structurally diverse and target a wide array of cellular processes. This suggests that transcript information can provide additional predictive information above genotype data alone.
As a second test, we created a combined set of features that included all transcripts and markers, totaling over 9,000 features. We repeated the above-described process of selecting the best 1, 10, 50, 100, 200, 500 and 1000 features, and then used them to train a support vector classifier. The set of 500 features performed best on average with an accuracy of 72%, essentially the same as the marker-based prediction using the same number of features (71.6%). Interestingly, genotyped markers comprised over 95% of all selected features, with many (60) SMP response predictions based solely on marker features. These results are consistent with the observation that genotyped markers provide most of the information used in SMP response prediction. However, when transcripts are selected, they often encode gene products involved in biological processes affected by the SMP. For example, carbonylcyanide p-trifluoromethoxyphenylhydrazone (FCCP) is a proton ionophore that depolymerizes the mitochondrial membrane potential
We and others have previously shown that naturally recombinant yeast strains provide a model for the study of therapeutically relevant complex traits (i.e., small-molecule drug response)
Transcript-based prediction performs similarly to marker-based prediction for most SMP responses, and thus expression profiles provide a useful proxy when genotypes are not available. Expression may sometimes be a better predictor of compound response than genotype because expression can integrate many genetic changes, and may therefore reflect the overall physiological state of the cell rather than just the effect of one locus. On the other hand, expression may be a poorer predictor of compound response than genotype in cases when transcript levels of untreated cells is uncorrelated to transcript levels of drug-treated cells. Gene expression may capture the same information as genotypes for several reasons. First, a polymorphism may affect both gene expression and compound response independently (pleiotropy), with the expression levels providing a read-out of the inheritance at the locus. Second, a polymorphism that affects compound response may be linked to a different polymorphism that affects gene expression. The third and most interesting case involves polymorphisms that affect the expression of drug targets or other genes that function in the pathways that are involved in SMP response; in this case the expression changes provide direct functional information. Further functional studies are needed to distinguish these possibilities and quantify the prevalence of each.
We observed that expression provided little predictive power over genotype alone. In our system, genotype largely determines both expression levels and drug response; environmental conditions were kept constant during expression experiments, and only differed on the basis of SMP treatment in drug response experiments. We expect that gene expression will provide considerable additional predictive power when environmental variation is present, for example in human patients who will differ in diet, drugs taken, and other factors. This study demonstrates the benefit of having multiple sources of data in understanding complex pharmacogenomic traits.
Segregants were classified as sensitive or resistant based on the standard deviation from zero of each segregant's six replicate growth values. If a segregant's average growth rate in the presence of a given SMP was at least one standard deviation below zero it was considered sensitive to that SMP; if the average growth rate was a standard deviation or more above zero it was considered resistant. Segregants with standard-deviation ranges overlapping zero were not classified, and were removed from the analysis. Segregant growth rates at various time points were available for some of the 92 SMPs surveyed, providing a total of 333 sets of segregant growth rates. To be able to determine a pattern between gene expression and sensitivity or resistance to a given SMP, a sufficient number of “sensitive” and “resistant” segregants are needed. Therefore only growth rate sets with at least 10 sensitive and 10 resistant segregants were treated, eliminating 107 growth rate sets.
Before applying the prediction algorithm, we reduced noise in the data by ranking each of the segregants' genes' association with drug sensitivity. For each of the 226 sets considered, a stratified 10-fold cross-validation scheme was used to select features and train support vector classifiers, and to test the classifiers. This involves random division of the data into ten similarly sized parts, each with a classification profile (in this case, the ratio of sensitive to resistant segregants) approximately representative of the full data set; one part is kept aside for testing the classifier, and the remaining nine subsets are used for feature selection and then training; the full selection/training/testing process is carried out ten times using a different portion of data for testing each time. Feature selection – in our case selecting the most relevant genes for segregant response to an SMP – was performed to reduce noise in the data and hopefully make any pattern more readily identifiable. The feature selection algorithm, performed within each fold of the cross-validation scheme, used a support vector machine (SVM)
(0.73 MB TIF)
(0.24 MB XLS)
DCR gratefully acknowledges stimulating discussions with Ingo Steinwart.