Conceived and designed the experiments: SK NN EK. Performed the experiments: NN. Analyzed the data: SK NN. Contributed reagents/materials/analysis tools: NN. Wrote the paper: SK NN EK.
The authors have declared that no competing interests exist.
Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of
Functional annotation of genes is a fundamental problem in computational and experimental biology. The problem can be solved at various levels of resolution ranging from identifying high level processes where a given protein might be associated with, to discovery of the cell specific protein-ligand interaction targets of a protein in different biological conditions. The most established and reliable methods for protein function prediction are based on sequence similarity using BLAST
Bayesian network methodologies for data integration have been explored
In this paper, our contribution is twofold. First, we propose a simple and relatively transparent probabilistic model for protein function prediction that allows us to efficiently calculate the posterior probability that each gene has a particular function, given various types of genome-wide data. Second, we analyze the effect of combining the heterogeneous data sources in a substantially more comprehensive manner than has been done to date, with the goal of better understanding just which types of genes benefit most from the integration of which types of data sources. In particular, we develop a relatively simple yet useful method to integrate functional linkage graphs with categorical information. The functional linkage graphs are constructed from PPI data and gene expression data. As usual the assumption here is that physically interacting proteins or co-expressed genes are more likely to share protein functions than a randomly selected pair of proteins
By combining five types of data, the number of correctly recovered known gene-term associations is increased by 18% at the same precision (50%), compared to using PPI data alone. We specifically focused on certain points on the ROC curve in our analysis that we believe are potentially feasible for follow-ups on the prediction in experimental labs. We show that by adding different types of genome-wide data, different types of the GO terms that are specific for the type of information are newly recovered. Also, by conducting robustness analysis of the integration model to PPI edge removal, we provide a novel perspective on the amount of PPI data necessary to obtain high prediction accuracy by the integration model. In that analysis, we find some conditions where integration actually hurts performance rather than improving accuracy. Plausible functions are assigned to 463 currently unannotated proteins by our method, and we discuss some of these novel assignments.
From the GRID database
Four types of gene expression data, the Rosetta compendium data
From the MIPS
From the MIPS
From the MIPS
From the 06/03/2006 version of the Yeast SGD database
From PPI data and gene expression data, two different functional linkage graphs are obtained. Here, an edge in each functional linkage graph shows that the two nodes (proteins) are a member of the constructed pairs in each data set. For each GO label
Proteins can be associated with categorical features according to different types of categorical information. The categorical features that are used in our predictive methodology are defined below.
Protein motif (domain): Random variable
Phenotype: Random variable
Protein localization: Random variable
Naturally, a protein can have several features at the same time. Our aim is to integrate these sources of evidence in a smooth fashion to improve the accuracy and coverage of the functional predictors based on the assumption that if a protein has specific features, then this can increase the probability to infer specific protein functions.
For each protein
We want to calculate the posterior probability given functional linkage graphs and category features of a protein
Applying Bayes' theorem, the posterior probability that gene
Here, we assume that the probability distribution of
Hence the neighborhood function (1) becomes:
The integration algorithm described in the
First, we attempted to predict known protein-term associations by 5-fold cross validation. For each gene
The ROC curve of recall experiment by 5-fold cross validation. Sensitivity is defined as #TP/(#TP+#FN), and specificity is defined as #TN/(#FP+#TN).
#TP at 50% precision (Upper) and #TP at 80% precision (Lower). Here,
These two levels of precision, i.e., 50% and 80%, were chosen as being reasonably representative of the range of possible improvements observed in our study. In addition to the performance characteristics just described, we also examined the issue of falsely predicted proteins, as a function of the threshold applied to posterior probabilities. Using the method of
Next, we analyzed whether prediction accuracy depends upon the functional category to be predicted. It is expected that the prediction performance of specific GO terms depends on what kinds of data sources one uses.
Here is an example how the combination of different types of data helps to predict protein function more specifically. Genes YKR055W, YIL118W and YJL128C have a GO term “intracellular signaling cascade”, but neither the PPI data nor the protein motif information alone can predict the GO term for the proteins. When PPI data alone is used, a GO term “signal transduction”, which is a parent of “intracellular signaling cascade” in the GO hierarchy and hence a broader term, can be predicted. However, when both PPI data and protein motif information are used, the GO term can be predicted correctly. In this case, information that the proteins have a protein motif “protein kinases signatures and profile” or “prenyl group binding site” helps to predict more specific term “intracellular signaling cascade” correctly.
In the recall experiment in Section 3.1, we showed that PPI data is the strongest source of evidence for protein function prediction in our model, compared to other data sources. Here, we want to know whether our integration model works well or not when the amount of PPI data is limited. In this experiment, a certain fraction of the PPI edges are randomly removed from the original PPI network, and then protein function is predicted using our integration model.
#TP at 50% precision (Left) and #TP at 80% precision (Right) with varying amount of PPI edges. At 50% precision, the integration model always wins over the PPI model. However, at 80% precision, the integration model wins only when more than 50% of original PPI edges are present.
By integrating five types of data, we assign plausible GO terms to 463 proteins among 1481 currently unannotated yeast
Among the predicted function of unannotated proteins, recent literature reported
It is confirmed here that 20 out of 463 function predictions for unannotated proteins are quite consistent with the conclusion from the recent publications. We expect that many of our predictions will turn out to be true after validation experiments.
All the biological data and a Perl program used in this analysis are available at:
In this paper, we propose a probabilistic method to predict protein function from multiple types of genome-wide data. Pair-wise information between proteins, such as PPI data or co-expression information is converted into a functional linkage graph, in which an edge between nodes represents evidence for protein function similarity. Category information, such as protein motif information, mutant phenotype data, and protein localization data is combined with the functional linkage graphs using a unified probabilistic framework. We showed in our 5-fold cross validation experiment that our method successfully improved prediction accuracy and coverage by integrating five types of genome-wide data. Also, by conducting robustness analysis of the integration model to PPI edge removal, we showed that there is a certain amount of PPI data necessary to obtain high prediction accuracy by the integration model. We proposed functional predictions for 463 currently unannotated proteins. One subjective aspect of our method is in the choice 0.85 in thresholding the correlation coefficients in constructing our co-expression functional linkage graph. However, we have found our results to be quite robust to this choice; for example, even much higher thresholds yield qualitatively quite similar results. In principle, a more objective choice of threshold could be made through the use of cross-validation, but this would come at the cost of an increased computational burden. Other limitations are that we assume probabilistic conditional independence between different types of functional linkage graphs and each informational category. Of course, this assumption might not always be correct in a biological sense. For example, some of physically interacting protein pairs are also co-expressed. However, previous literature has reported that Naive Bayes frequently tends to work well, and frequently better than more sophisticated classifiers, when the data are sparse compared to the dimensionality of the problem, even when the features (e.g., in our case, the functional linkage graphs and category feature vectors) are not truly conditionally independent
Although the result presented here is a case study for yeast
Improved GO terms by adding gene expression data at 50% precision.
(0.02 MB XLS)
Improved GO terms by adding protein motif data at 50% precision.
(0.02 MB XLS)
Improved GO terms by adding phenotype data at 50% precision.
(0.02 MB XLS)
Improved GO terms by adding localization data at 50% precision.
(0.02 MB XLS)
Prediction result of unannotated genes. N1 is the number of neighbors in PPI network, k1 is the number of t-labeled (t is the predicted GO term) neighbors in PPI network, N2 is the number of neighbors in co-expression network, and k2 is the number of t-labeled neighbors in co-expression network.
(0.23 MB XLS)
We thank Dr. Lingang Zhang, Dr. John Rachlin and Dr. T. M. Murali for helpful discussion, and anonymous reviewers for helpful comments on the manuscript.