Reader Comments

Post a new comment on this article

an unpublished review

Posted by AndrewPeek on 04 Nov 2010 at 00:18 GMT


I keep an updated list of machine learning papers that have been applied to siRNAs, so have updated that on the PLoS website yesterday.

The following is an outline of a review that I may never get around to finishing.

Until I get some spare time to finish this, this outline might as well go here too:



What have we learned from siRNA efficacy predictor modeling?

Gene expression knockdown by synthetic short interfering RNAs (siRNAs) has become a standard protocol in the short time the methodology has been available. During this time a heterogeneous set of methods have been used to computationally model siRNAs to 1) understand what features contribute to siRNA effectiveness and 2) develop learning systems to predict siRNA. The 40 works listed in Table 1 each provide greater depth into the features and learning methods being used to develop predictive models for siRNA effectiveness. However, despite the efforts into better understanding the features and models that best describe effective siRNAs, there are many unaddressed questions or topics that could help to guide computational research in this area.

Several recent reviews provide an overview for some of the conclusions that can be drawn from siRNA modeling, and the intention here is not to duplicate those efforts, but to provide some observations and suggestions for moving forward.

Generalization of the Model

Since it is possible that a learning method can be found to predict a dataset perfectly from the observed data, it is more relevant to focus on how the selection of features and learning algorithm perform on data not seen in model construction. Therefore the resubstitution rate or the model performance rate on data used in model training is an overly optimistic measure of the model’s performance. There are several methods that can be used to estimate the error bounds of the learning method or estimate the generalization error rate of the method. When proposing a method or feature set selection process it is important to understand the generalization error of that method or selection process.

The Rashomon Effect

How many unique or minimally-overlapping feature sets can result in nearly identically performing models? Similarly, how many learning methods can be used to find nearly identically performing models? The instability or non-uniqueness of specific features appears to be common, particularly when model performance is highly dependent on both feature as well as learning method. Additionally, it is likely that depending on the feature set under investigation that several learning methods may have similar performance. The observation that many nearly independent and distinct conclusions can be drawn from the same dataset suggests that care should be used in the interpretation of the single feature set or learning method. This phenomenon being named the Rashomon effect is attributed to Leo Breiman 2001 (2001, Statistical Science Vol. 16, No. 3, 199–231) after the movie Rashomon from Akira Kurosawa which retells a crime from 4 perspectives leaving the viewer with no certainty about which, if any, perspective was correct.

TANSTAAFL

While not exactly the original meaning of There Ain’t No Such Thing As A Free Lunch, the No Free Lunch theorem states that there are no inherently superior learning systems, independent of the learning task. Therefore claims of any inherent superiority of neural networks, support vectors, decision trees, linear or logistic regression methods across all problem domains should be met with some warranted skepticism. All learning model building systems have strengths and weaknesses in their assumptions, and the chance that one method is superior or inferior to another is more of a statement about how the system being investigated is particularly suited to the assumptions of that method.

The Ugly Duckling

Similar to the No Free Lunch theorem, but for the features rather than the learning system. The Ugly Duckling Theorem states that there is no problem independent best set of features. Again, the similarity between patterns is specifically based on implicit assumptions about the problem domain.

The nice thing about standards is that there are so many of them to choose from.

There are a large number of measures for model performance. A short list might include: accuracy, average precision, lift, squared error, ROC Area, F-score, cross-entropy, precision/recall break-even point, calibration, et cetera. I’m certain that everyone has their favorite measure and will defend it to the exclusion of the others, but if methods are to be comparable a small number of commonly used measures should be agreed upon. Using a non standard measure of model performance is suitable, but not at the expense of not using a more commonly performed measure. Not all models are universal solutions, so its good to know both a models strengths and weaknesses presumably across several measures of performance.

If its reproducible, its science.

Some data are proprietary; some software is closed source. These are facts of everyday life for informatics research. However, I would argue along the same line of reasoning as Dudoit et al. 2003 (2003, BioTechniques 34:S45-S51), page 46, for the reasons that both source code as well as suitable data need to open and available in the scientific community.

A quote from Dudoit, Gentleman and Quackenbush.

“Two obvious questions that arise are why anyone would want
to release their software code and why others would want to add
new utilities and functionality to someone else’s software. Aside
from the obvious benefits of creating a community resource that
can advance the field, there are several advantages to an open
source approach to software development in a scientific environment,
including:

• full access to the algorithms and their implementation,
which allows users to understand what they are doing when
they run a particular analysis

• the ability to fix bugs and extend and improve the supplied
software

• encouraging good scientific computing and statistical practice
by providing appropriate tools, instruction, and documentation

• providing a workbench of tools that allow researchers to explore
and expand the methods used to analyze biological
data

• ensuring that the international scientific community is the
owner of the software tools needed to carry out research

• promoting reproducible research by providing open and accessible
tools with which to carry out that research (reproducible
research as distinct from independent verification).”



These points are valid for both software as well as data. It is probably worth pointing out that the initial release of the Novartis dataset of 2431 siRNAs (Hueksen et al. 2005 NBT) had the predicted knockdown values associated with each siRNA rather than the empirically observed knockdown values. One might see this in two ways, first that the initial error caused problems with predictive models being developed to predict predicted values or secondly in a positive light that since the data were open for outside scrutiny that this original oversight was corrected in a rather short time. Had the data not been available, any subsequent debate about comparisons between any model system would have been a moot point for this dataset.

Claims of model performance that are not reproducible (in the same sense as reproducible research as being distinct from independent verification) are then simply weak assertions of an observation without the necessary documentation. If a molecular biologist is expected to show with reasonable documentation the set of experimental procedures and results as gel images, or other pieces of raw or reduced data, then computational biologists should be expected to show their procedures, namely the code and the data that generate the results.


Model Interpretation versus Accuracy

A large amount of work in the machine learning area has suggested for some time that these are somewhat polar opposites. If you want your model to be interpretable it will probably be less accurate, and conversely boosted random forest models tend to be highly accurate on many kinds of data and yet interpretation is difficult. Breiman refers to this as Occam’s dilemma and perhaps these need to be distinct goals for learning methods. As biologists or chemists we want to point to a part of a molecule and say something like this carbon is responsible for 30% of the activity of the drug in this model. So perhaps we need distinct goals, one for developing highly accurate or precise models and a second for developing highly interpretable models. Not necessarily at odds with one another, but complementary goals with different intended outcomes.




Some Open Questions

Is classification or regression a better (or even a more reasonable) approach for siRNA efficacy modeling?

If performing a classification, does choosing the widest margin of training data from active versus inactive lead to better predictors when compared to simply dividing a dataset into 2 parts, possibly conflating the classes?

Is there a siRNA dataset or subset that should be used as a gold standard?

Which model assessment metrics are appropriate when comparing among models?


Competing interests declared: author