Analyzed the data: P-RL GT BB. Wrote the paper: P-RL GT BB.
The authors have declared that no competing interests exist.
A major goal of large-scale genomics projects is to enable the use of data from high-throughput experimental methods to predict complex phenotypes such as disease susceptibility. The DREAM5 Systems Genetics B Challenge solicited algorithms to predict soybean plant resistance to the pathogen
Predicting complex phenotypes from genotype or gene expression data is a key step toward personalized medicine: the use of genomic data to improve the health of individuals, for instance by predicting susceptibility to disease or response to treatment
It is difficult to objectively measure progress on algorithmic challenges without standard benchmarks; within this context, the Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative
As a top performer on the second part of the challenge, we were invited to present our results at the DREAM5 conference and contribute to the DREAM5 collection in PLoS ONE; this paper describes our approach. We provide a comparison of several regularized regression models and find comparable performance of elastic net, lasso, and best subset selection. We also carefully analyze the level of noise in the data and consequent variability in performance and offer practical suggestions for similar data analysis and data pre-processing.
The data for this challenge were collected from a systems genetics experiment conducted at the Virginia Bioinformatics Institute
After infection with
The training data, from 200 RILs, thus consisted of a
We began our analysis for this challenge by computing correlation coefficients of the genotype and gene expression training features against the two phenotype variables. The magnitudes of these correlations guided our choice of modeling technique; we also later used correlation-sorted rank lists to limit the scope of computationally intense calculations to those features most likely to be relevant.
On first glance the highest correlations, above 0.3 for the expression data (
Top correlations | Genotype | Expression | ||
(absolute values) | Training | Random | Training | Random |
Phenotype 1 | 0.2155 | 0.2404 | 0.3034 | 0.2835 |
0.2122 | 0.2116 | 0.2976 | 0.2781 | |
0.2061 | 0.1862 | 0.2975 | 0.2749 | |
0.2054 | 0.1857 | 0.2963 | 0.2689 | |
0.2041 | 0.1851 | 0.2909 | 0.2611 | |
Phenotype 2 | 0.2433 | 0.2127 | 0.3441 | 0.2777 |
0.2261 | 0.2104 | 0.3084 | 0.2684 | |
0.2198 | 0.2053 | 0.2990 | 0.2679 | |
0.2181 | 0.1928 | 0.2824 | 0.2642 | |
0.2180 | 0.1926 | 0.2754 | 0.2619 |
The top five correlations found in the training data are shown, as are the top five correlations against a random 0–1 matrix with the same dimensions as the genotype data and a random normal matrix replacing the gene expression data.
These observations suggest that most features have little or no predictive power, and hence proper regularization is crucial for modeling this dataset. Additionally, the small difference between training correlations and the random background distribution indicate that the prediction task at hand is difficult; the amount of signal in the data is likely quite small.
In light of the above considerations, we sought to keep our modeling simple and chose regularized regression as our general approach. Before fitting the data, however, we needed to ensure that the relation between predictor and response variables was as linear as possible, and so we considered data transformations and basis expansions.
Upon plotting the phenotype training data, we discovered that the variance in the distribution of phenotype 1 is dominated by outliers. Among the 200 measurements of phenotype 1, the largest outlier is 5.83 sample standard deviations from the mean. Moreover, the seven most deviant samples account for more than half of the total variance. For phenotype 2, the largest outlier is a substantial 3.77 standard deviations above the mean but overall the distribution does not have unusually long tails compared to a normal distribution. A plot of the fractions of variance explained by increasing subsets of largest outliers in phenotype 1, phenotype 2, and random data illustrates this behavior (
The largest seven outliers in phenotype 1 account for the bulk of the variance in the data; in contrast, the outlier distribution for phenotype 2 is similar to that of a random normal variable.
Motivated by the Spearman correlation-based scoring scheme used in this challenge, which judges predictions based on ordering rather than absolute accuracy, we applied a rank transformation to phenotype 1 to remove the impact of outliers on regression models. More precisely, we replaced the numerical values of phenotype 1 measurements with their ranks among the 200 sorted samples. Because the approaches we applied minimized squared error (along with regularization terms), asking our models to predict ranks rather than actual values removed the heavy weight that outlier values would otherwise have received. Absolute predictions could of course be recovered by interpolation if desired.
With only binary genotype data available for prediction in subchallenge B1, we hypothesized that the true phenotypic response for a genotyped sample would be far from linear. The simplest possible example of a nonlinear effect is interaction between genotype markers: for instance, if two genes act as substitutes for one another, their function is only suppressed if both are turned off. Similarly, if two genes are critical to different parts of a pathway, turning off either one would impair its function.
With these examples in mind, we considered applying logic regression
To gauge the efficacy of these combined features, we compared the largest fractions of variance explained by single boolean combination features (using single-variable least-squares regression) to the best fits obtained by two-variable regression on pairs of the original genotype features. Looking at the 20 best-performing regressions from each group (
The plot compares the best least squares fits attainable under three model types: single-variable regression using each genotype feature independently (blue), two-variable regression using pairs of features at once (green), and single-variable regression using pairs of features combined through a binary boolean relation (red). The best single-variable fits using boolean combination features outperform the best two-variable regressions.
An important caveat to keep in mind when interpreting these measurements is that the number of feature combinations considered is very large (nearly 2 million), thus allowing random chance to inflate best performances as in the case of correlations examined above. Nonetheless, we expect that the relative trends are still informative.
Upon closer inspection of the best boolean combination markers, we discovered that some were near-trivial due to linkage disequilibrium (
The heat map shows Pearson correlations between pairs of genotype markers; most pairs have only slightly positive or negative correlations attributable to chance, but groups of nearby markers exhibit distinctly positive correlations.
Having taken steps to linearize the predictor-response relationship, we applied regularized regression to model the data. Classical linear regression on a predictor matrix
Our main approach of choice was elastic net regression
For the purpose of comparison, we also tried fitting the data with a simple best subset selection approach, which seeks to minimize squared error using only a limited number of regressors. (In the language of our above discussion, this constraint can equivalently be viewed as imposing an
Implementation details are as follows. For elastic net regression, we ran glmnet with
For best subset selection, we first filtered to the top 30 features with strongest correlations to phenotype (recomputed for each cross-validation training set). We then used simulated annealing to compute subsets of size 1–20 features obtaining approximately optimal linear fits to each training fold. The annealing procedure consisted of 5 runs of initialization with a random feature subset of the required size followed by 5000 iterations of attempted swaps, using a linear cooling schedule. Explictly, the acceptance probability of a swap was
We evaluated our regression methods using 7-fold cross-validation on the 200-sample training set, measuring goodness of fit with Spearman correlation to match the DREAM evaluation criterion. We chose to use 7 folds so that our cross-validation test sets during development would have approximately the same size as the 30-sample gold standard validation set, allowing us to also estimate the performance variance to be expected on the validation set. We applied each regression technique–elastic net, lasso, and approximate best subset selection with simulated annealing–to fit phenotype 1 (rank-transformed) and phenotype 2 individually, using sets of regressors corresponding to the three subchallenges of DREAM5 Systems Genetics B: genotype only (B1), gene expression only (B2), and both genotype and expression (B3). Within subchallenge B1, we ran two sets of model fits, one using only raw genotype markers as regressors and the other using the boolean basis expansion described in Methods.
Because of the relatively small number of samples and large number of predictors, the random assignment of samples to cross-validation folds caused substantial fluctuation in performance, even when averaging across folds. We overcame this difficulty by running multiple cross-validation tests for each model fit using different fold assignments in each run (20 replicates for elastic net and lasso and 5 replicates for best subset selection), thus obtaining both mean performances and estimates of uncertainty in each mean. We chose regularization parameters for each method in each situation to optimize mean performance;
We tested elastic net, lasso, and approximate best subset selection on phenotypes 1 and 2 using regressor sets derived from the DREAM5 subchallenges B1, B2, and B3. In each case the regularization parameter(s) were chosen to optimize average Spearman correlation. We ran multiple cross-validation tests with different random fold splits to reduce uncertainty in mean performance and enable comparison between methods; error bars show one standard deviation of confidence.
Overall, the three regularized regression techniques perform quite comparably. Note that elastic net regression necessarily always performs at least as well as lasso (because lasso corresponds to the elastic net with parameter choice
Comparing the different regressor sets, subchallenge B1 with genotype data only is clearly the most difficult. The availability of gene expression data in subchallenges B2 and B3 dramatically boosts average Spearman correlations to the 0.25–0.3 range for phenotype 1 (though performance for phenotype 2 is largely unchanged in the 0.15–0.2 range typical for all other cases). Unfortunately, our regression models did not attain a performance increase from B2 to B3 with the inclusion of genotype data along with expression data, nor did boolean basis expansion appear to help with performance on B1.
Surprisingly, the rank transformation we applied to phenotype 1 turned out to have the greatest impact of the pre-regression data transformations we attempted. For the purpose of comparison, we performed the same model-fitting as above using raw (untransformed) values of phenotype 1. In all cases the rank transformation increases average Spearman correlations considerably (
Each scatter plot shows predictions from one cross-validation run on the training data (blue points) as well as predictions of the fitted model for the gold standard test set (red points). For the elastic net modeling on rank-transformed data (right plot), predictions of phenotype 1 values on an absolute scale were obtained by interpolation. The reported values of
Spearman corr. before and after transformation | ||||||
Subchallenge (regressors) | Elastic net | Lasso | Best subset | |||
B1 (genotype) | 0.058 | 0.107 | 0.054 | 0.095 | 0.092 | 0.167 |
B1 (genotype with basis expansion) | 0.042 | 0.085 | 0.011 | 0.048 | 0.025 | 0.102 |
B2 (expression) | 0.099 | 0.257 | 0.094 | 0.237 | 0.111 | 0.285 |
B3 (genotype and expression) | 0.090 | 0.243 | 0.077 | 0.230 | 0.092 | 0.272 |
Applying the rank transform to phenotype 1 increases average cross-validated Spearman correlations for all regression approaches and regressor sets we tested. The performance improvement is especially large for subchallenges B2 and B3, where gene expression data is available.
Taking a closer look at the optimal regularization parameters for elastic net, lasso, and approximate best subset selection, we discovered strikingly low model complexity prescribed by cross-validation in each case. As an example, the blue curves of
Each plot follows the performance of a regression model as complexity increases. For lasso (top plots), model complexity is determined by a regularization parameter
With lasso, we likewise see that performance drops off quickly as model complexity increases; here, the complexity parameter
To better understand the strong regularization, we provide heat maps displaying the feature weight distributions chosen by the elastic net to predict phenotype 1 (rank-transformed) and phenotype 2 for a set of cross-validation runs on subchallenge B2 (
The heat maps show regression coefficients chosen by the best-fit elastic net models as each cross-validation fold is in turn held out of the training set. The features shown on the vertical axis are those having a nonzero coefficient in at least one of the seven runs; they are indexed by their rank in
As mentioned earlier, our cross-validation analysis also allows us to estimate the accuracy to which algorithm performance can be measured using a 30-sample test set. Unfortunately, we find that this test size is insufficient for accurate evaluation: whereas the greatest-weight features selected by our models are relatively stable from fold to fold (
The red curves of
Notwithstanding the caveat just discussed regarding uncertainty in results on a small test size, we include the final results from the DREAM5 Systems Genetics B challenge for completeness (
All teams had difficulty even achieving consistently positive correlations; we suspect the main obstacles were the large amount of noise in the data and the small 30-sample gold standard evaluation sets. We achieved the best performance on the test set used for subchallenge B2 (prediction using gene expression data only).
While the performance achieved by our methods–indeed, by every team's methods–is modest, our work does highlight a few important lessons in statistical learning and in the setup of algorithmic benchmarking challenges such as DREAM. Regarding the first, our analysis did not lead us to a radically new and complex model for the genotype-phenotype relationship in
One might hope that the transparency of such simple models can shed light on the underlying biological mechanism at work; while this may be possible, we also should caution against trying to glean more from the models than the data allow. Simplicity may be due to the involvement of only relatively few genes or just to the fact that heavy regularization makes models less prone to overfitting. In light of the noisiness of the dataset, we suspect the latter may be true. As a case in point, while we were disappointed that modeling pairwise interactions through boolean basis expansion did not improve fitting using the genotype data, we still find it quite plausible that such effects are at work and may aid modeling in situations when more data is available. With this dataset, our techniques were likely unable to discern these effects because the limited data size could not support the increased complexity that modeling interactions would entail.
Overall, while this contest was perhaps too ambitious for the data available, we feel it succeeded in stimulating research and discussion in the field. The original motivation of developing methodology for combining genotype and gene expression data to improve phenotype prediction remains a worthy goal and interesting open question.
We thank Michael Yu and Alberto de la Fuente for helpful discussions. We thank the DREAM5 organizers for designing and developing the challenge. We thank Alberto de la Fuente, Ina Hoeschele and Brett Tyler for contributing the soybean data and developing the challenge. We thank the two anonymous reviewers for suggestions that improved the clarity of the manuscript.