Conceived and designed the experiments: SR DM TL MWW. Performed the experiments: SR DM HP. Analyzed the data: SR DM HP TL MWW. Contributed reagents/materials/analysis tools: DM HP. Wrote the paper: SR TL MWW.
The authors have declared that no competing interests exist.
Computational prediction of protein interactions typically use protein domains as classifier features because they capture conserved information of interaction surfaces. However, approaches relying on domains as features cannot be applied to proteins without any domain information. In this paper, we explore the contribution of pure amino acid composition (AAC) for protein interaction prediction. This simple feature, which is based on normalized counts of single or pairs of amino acids, is applicable to proteins from any sequenced organism and can be used to compensate for the lack of domain information.
AAC performed at par with protein interaction prediction based on domains on three yeast protein interaction datasets. Similar behavior was obtained using different classifiers, indicating that our results are a function of features and not of classifiers. In addition to yeast datasets, AAC performed comparably on worm and fly datasets. Prediction of interactions for the entire yeast proteome identified a large number of novel interactions, the majority of which co-localized or participated in the same processes. Our high confidence interaction network included both well-studied and uncharacterized proteins. Proteins with known function were involved in actin assembly and cell budding. Uncharacterized proteins interacted with proteins involved in reproduction and cell budding, thus providing putative biological roles for the uncharacterized proteins.
AAC is a simple, yet powerful feature for predicting protein interactions, and can be used alone or in conjunction with protein domains to predict new and validate existing interactions. More importantly, AAC alone performs at par with existing, but more complex, features indicating the presence of sequence-level information that is predictive of interaction, but which is not necessarily restricted to domains.
Protein interaction networks are networks of physical interactions among proteins and constitute an important component of the bio-molecular network in cells. Capturing the complete set of protein interactions is crucial for understanding the programs for cellular response to different environmental stresses. Although high-throughput technology has advanced our knowledge of proteomes of many organisms
Computational prediction of protein interactions are becoming increasingly popular because they provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale
Protein domains are the most commonly used static features for classification of protein interactions. Although protein domains yield high accuracy classifiers by incorporating evolutionarily-conserved information, these classifiers can only predict interactions between proteins with known domain information. In this paper, we ask the question if we can predict interactions among proteins without relying on domain information, and if so, how complex do our features need to be to perform as well as classifiers using domains as features. In particular, we focus on evaluating classifiers that use simple amino acid composition (AAC) features, which are based purely on normalized counts of single or pairs of amino acids for predicting protein interactions. Approaches that do not rely on domains typically use more complex sequence features comprising
We performed a systematic analysis of the contribution of AAC to the prediction of protein interactions focussing on different types of datasets (Co-complex, Two-hybrid, Protein Complementation Assay) and classifiers (Maxent, Support vector machines, Naive Bayes). This allowed us to assess the predictive power of AAC over a range of datasets and classifier types.
Interactions predicted in yeast
Finally, we combined predictions from classifiers trained on the three yeast datasets to generate a high confidence yeast interactome. Our predicted interactions had significantly higher tendency to co-express, co-localize, and participate in the same process as compared to the predicted non-interactions, providing expression and gene ontology-based support of our interactions. Our predicted interactions also included several uncharacterized proteins, including a highly connected hub, YJR151W-A, to which we assigned putative functions based on their interaction partners
Overall, AAC has these benefits: (a) AAC is a simple, yet powerful feature which performs surprisingly well given its simplicity, (b) AAC can be used to predict protein interactions irrespective of domain information availability, allowing interaction predictions among uncharacterized proteins for which domain information is scarce, (c) good performance of AAC is independent of the classification framework, and, (d) extraction of AAC features is computationally much more tractable than other non-domain features, making them easily applicable to higher organisms with lengthy protein sequences.
We first compared AAC against the evolutionarily-rich protein domain features for predicting interactions in the three yeast interaction datasets. We then compared AAC against the tuples and signature product features, which like AAC do not require protein domain information on yeast, worm and fly datasets. We then performed a post-hoc feature analysis to identify the AAC features that were most beneficial for predicting interactions. Finally, we used classifiers combining AAC and domains to predict the complete yeast interactome and validated novel interactions using Gene ontology.
The goal of comparative analysis was: (a) to determine how well a simple feature like AAC performed against well-known features such as domains, (b) to assess if AAC features can improve performance when used in combination with domains, (c) to compare AAC to other non-domain sequence features such as the tuple feature
We trained and tested classifiers on the three yeast datasets (TWOHYB, AFFMS, PCA), selecting only protein pairs for which domain information was available for both proteins. We selected only protein pairs with domains to have a fair and direct comparison against a classifier that relies only on domains for interaction prediction (
Results are for three classifiers (MAXENT, SVM, NAIVE BAYES) over three yeast datasets (TWOHYB, AFFMS, PCA). The error bars are obtained from five-fold cross validation.
To assess the value of combining AAC with evolutionarily-rich domain features, we combined domains with AAC monomer and AAC dimer features and compared the performance of classifiers using the combined set of features against classifiers using either of these features alone. We estimated performance on the protein pairs for which we had domains to allow comparison against a classifier which used only domains as features (
Results are for SVM classifier.
We compared the performance of AAC with other non-domain features, which can also predict interactions between proteins lacking domain information. The two features that we evaluated were the tuple feature from Gomez
We compared AAC against the tuple and Sigprod features on three yeast datasets (TWOHYB, AFFMS, PCA). The three datasets were each split into two parts: protein pairs with domains and protein pairs without domains. We report the performance on protein pairs with domains (With domains), on protein pairs without domains (No Domains) and on the complete dataset (All Protein pairs). The AUC-scores on protein pairs without domains evaluated how well non-domain features including AAC are able to predict interactions (or non-interactions) among proteins for which no domain information is available. The AUC-scores on the complete datasets evaluated the overall performance of different features on protein pairs irrespective of domain information availability. These results are for the SVM classifier, because it provides performance numbers for all features (Sigprod is specific to a SVM classifier). Results for the Maximum entropy classifier are similar (Supporting
On protein pairs without domains (
Non-domain features (Tuple, Sigprod) were compared on protein pairs with domains (With domains), pairs without domains (No domains) and on the entire dataset (All protein pairs).
On protein pairs with domains (
In addition to comparing AAC features on the three yeast datasets described above, we also compared AAC features on two-hybrid datasets from worm and fly (
The fact that AAC monomer and dimer features can have at par performance with more complex features such as tuples, domains or signature product is very surprising considering the simplicity of these features. To investigate what makes AAC a good feature for protein interaction prediction, we considered a classifier using both domains and AAC features and obtained the AAC features that occurred among the
We considered the true positives among the top
The top
We found that several of the AAC monomer and dimers were statistically over-represented in regions representing protein-protein interaction domains (
To visually illustrate that the dimers were capturing meaningful information of interactions we considered two proteins, EFT2 from the high-confidence interacting pairs, and RNR1, from the high-confidence non-interacting pairs (
Only dimers important for prediction are shown with the rest of the protein structure as
We examined the overlap between the statistically over-represented AAC monomers and dimers from the three datasets using
Dataset combination | AAC Features |
AFFMS, TWOHYB, PCA | A I W |
AFFMS, TWOHYB | V A |
AFFMS, PCA | L G |
TWOHYB, PCA | FL G GL LA LG WA W |
AFFMS | A |
TWOHYB | FG FI FV GM I |
PCA | AA AF AI AV CF CM F FA FF F |
AAC features enriched in domains in different combinations of the three datasets. Each row represents the features that were exclusive to the dataset combination in the first column. A: Alanine, C: Cysteine, D: Aspartic acid, E: Glutamic acid, F: Phenylalanine, G: Glycine, H: Histidine, I: Isoleucine, K: Lysine, M: Methionine, Q: Glutamine, R: Arginine, T: Threonine, V: Valine, W: Tryptophan, S: Serine, Y: Tyrosine. Bold indicates polar, and underline indicates charged. Non-bold indicates non-polar.
Features that were exclusive to AFFMS included all the charged amino acids (D, E, K, R, H), and one polar (Q) and remaining non-polar amino acids (A, G, M, V, I). In contrast PCA and TWOHYB had very few charged amino acids, only Aspartic acid (D) in TWOHYB, and Aspartic acid (D) and Argnine (R) in PCA. The presence of all charged amino acids in the AFFMS suggests charge may be important for forming large protein complexes. Features exclusive to TWOHYB, had only one polar (Glutamine, Q) amino acid and the remaining were all non-polar (F, G, I, V, M). Finally, features exclusive to PCA included polar (Q, T, Y), charged (D,R) and non-polar amino acids (A, F, C, M, V, W). Overall PCA had the maximum range of amino acids, even though it was the smallest data set.
Our post-hoc analysis of important features led us to conclude that several AAC monomers and dimers were significantly enriched in domains involved in protein interactions, but the specific features that were deemed important depended on the dataset: features involving charged amino acids in AFFMS, and non-polar amino acids in TWOHYB and a mixture of polar and non-polar amino acids in PCA.
To predict interactions in the entire yeast genome, we trained three classifiers on the AFFMS, PCA and TWOHYB datasets. The predicted interactome was created from the intersection of the interaction sets predicted by each classifier. We considered intersections at different confidence levels, ranging in 80%–95%, and identified the number of known interactions at each confidence level (
Confidence level | Predicted | Known | Predicted + self loops | Known + self loops |
0.95 | 1144 | 86 | 1412 | 197 |
0.90 | 4352 | 194 | 4862 | 373 |
0.85 | 9769 | 313 | 10495 | 532 |
0.80 | 17084 | 449 | 18030 | 708 |
Number of predicted (using ACC and domains) and known interactions, where known interactions are those present in either AFFMS, TWOHYB or PCA.
Because many of our interactions were novel, we carried out preliminary validation using expression data and gene ontology categories
Co-expression is measured by Pearson's correlation coefficient.
We further analyzed these interactions for co-localization, co-function, and co-process using GO Slim terms and found that proteins predicted to interact tended to co-localize, or participate in the same processes more than the proteins predicted to not interact (
Predicted interactions and non-interactions at different confidence levels were analyzed for different types of co-annotation: co-process, co-location and co-function.
We identified 1412 high confidence (95%) interactions, including 197 existing interactions. We examined more closely the most highly connected nodes (hub nodes) of this high confidence network, where a hub was a node with
Gene ontology enrichment of the hubs identified cell budding, cytokinesis and mRNA stability and catabolism as additional enriched processes. Other protein hubs were also involved in a variety of processes including nuclear transport (KAP95, SRP1), transcription (NOT3, NAB3) and telomere maintenance (GAL11, STO1). Because hubs captured the majority of the interactions, we concluded that interactions in the high confidence network were involved in cell-budding, actin assembly, nuclear pore transport and mRNA stability.
One of our goals, using sequence-based interaction classifiers, was to capture and analyze interactions among proteins that cannot be analyzed using domain-based methods. This is especially useful for
The uncharacterized ORFs are in magenta and the characterized are in yellow.
ORF | Degree | Putative function |
YJR151W-A | 37 | protein-RNA complex assembly, mRNA processing, asexual reproduction |
YGR174W-A | 9 | cell budding, asexual reproduction |
YAR035C-A | 8 | cell budding, asexual reproduction |
YGL007C-A | 8 | cell budding, asexual reproduction |
YMR124W | 5 | organelle organization and biogenesis |
Degree specifies the number of interaction partners of a protein.
We have described a novel sequence-based feature, amino acid composition (AAC), that can be used to predict protein interactions in different organisms. Compared to other sequence-based features, AAC is much simpler because it models very little sequential dependencies (domains and tuples) and no explicit pairwise information (Sigprod). Surprisingly, despite its simplicity, AAC performs at par with domains on protein pairs for which domain information is available. The good performance of AAC, in spite of its strong independence assumptions, maybe due to its similarity to the
Compared to tuple features, AAC gave better performance, which was surprising because tuples incorporate ordering information of sequential amino acids. A possible explanation is that grouping of amino acids into six categories, may be too coarse, and by doing so, the tuple features are excluding information specific to individual amino acids, crucial for characterizing protein interactions. Comparison to Sigprod indicated that AAC performed at par on protein pairs without domains, and also on the complete set of protein pairs including those without domains.
On protein pairs with domains, Sigprod features are the best, outperforming all other features (including domains) on at least one dataset. The fact that Sigprod outperforms AAC features is not surprising because it captures more sequential dependency by looking at trimers rather than dimers or monomers. A natural extension of the AAC features would be to look at trimers. However, Sigprod outperforms even domains, which is very surprising because domains represent much longer portions of the amino acid sequence. It is possible that protein interactions do not involve the complete domain, but specific contact points within the domains. Dimers and trimers (Sigprod) are able to capture these crucial contact point information thus providing good performance. Sigprod also uses a specialized string kernel, which gives it additional benefits and therefore improved performance. In contrast, we use AAC features with the general purpose Gaussian kernel. Developing a specialized string kernel for AAC features is a direction of future research.
The value of AAC is evident for protein pairs in which one or both proteins have no domain information. Using a classifier with amino acid composition we were able to predict interactions of several uncharacterized proteins and were able to predict novel function for some of these proteins based on the known annotation of their interacting partners.
The post-hoc analysis of why a simple feature like AAC works so well by itself showed that several of the AAC features were significantly over-represented in domains involved in protein-protein interactions. This indicated that AAC features are likely capturing crucial contact points of the protein domains, and therefore helping in prediction. Although amino acids have been previously shown to have differential concentration in different interaction surfaces
We found that AAC features constitute a non-trivial fraction (26–40%) of the 100 most important features in a classifier using both AAC and domains as features. If protein domains were capturing all the properties of interacting proteins, we would not expect AAC features to be important when used with domains. This is further supported by the observation that only a subset of the AAC features important for interaction prediction were enriched in known interacting domains. This suggests the possibility of certain properties of interacting surfaces that are not fully captured in algorithms that search only for protein domains. AAC features can provide a cue for detecting novel types of protein domains encoding meta-level information important, possibly for docking of a protein partner or presentation of the interaction domain. Such meta-interaction surfaces identified with high confidence can be experimentally verified, leading to the identification of new types of protein domains, that may not be necessarily linear.
Our prediction results using simple amino acid composition have been quite encouraging, and has opened a plethora of questions regarding the information that can be captured at the level of single and pairs of amino acids. Extending this work to recognize higher-order signals in the proteome, including identification of meta-domains, can provide insight into the causal and mechanistic details of protein interactions.
Prior to prediction of protein interactions, we represent every protein pair in our datasets using binary feature sets corresponding to attributes of protein pairs. These features correspond to AAC features and domain features.
We use two types of features for representing AAC: monomer and dimer features. Monomer features capture composition of individual amino acids, whereas dimer features capture composition of pairs of consecutive amino acids. To generate the monomer features, we first obtain a
To generate the dimer features, we obtain a
The domains are represented as binary features, with each feature identified by the domain name. For yeast proteins, we use domains that are available for download from the Saccharomyces genome database. For fly and worm proteins, we used interproscan domains
We have compared the AAC features against other non-domain, sequence-based features. These features are the tuple features
The signature products are used directly within a support vector machine (SVM) framework where protein pairs are represented using a specialized signature product kernel. This approach first extracts signatures of length
Prediction of protein interactions via binary classifiers is a well-known approach, briefly outlined here. In this approach, each data point corresponds to a protein pair
Protein pairs that represent interacting pairs have
A maximum entropy classifier is a probabilistic classifier for binary classification
Here,
A naive Bayes classifier is similar to a maximum entropy classifer in that it uses the class conditional distribution to assign a protein pair to the interacting (INTR) or non-interacting (NON-INTR) class. However, the form of conditional distribution is given by
The proportionality term,
A support vector machine (SVM) classifier does not estimate class conditional distributions, but rather a maximum margin hyper-plane between the positive and negative examples of the class. The hyper-plane is defined by a set of
We analyzed several protein interaction datasets from yeast, worm and fly (
Dataset | Interactions with domains | Interactions without domains |
AFFMS | 22183 | 2409 |
TWOHYB | 6038 | 1156 |
PCA | 2288 | 292 |
WORM | 14233 | 972 |
FLY | 66259 | 4052 |
All datasets other than WORM were obtained from
We obtained the amino acid sequence for yeast proteins from the Saccharomyces genome database (SGD,
We pre-processed each dataset to ensure that the number of positive and negative examples in a dataset were equal. We retained self-interacting pairs in the positive set to allow the prediction of such interactions in the test set. We empirically verified that the self-interacting pairs did not influence classifier performance (
We evaluated the classifiers using the receiver-operator characteristic (ROC) curves that compared the sensitivity of the classifier as a function of the false positive rate
To assess which features were most beneficial for our predictions, we computed a feature importance score,
This score assigns a high positive value to a feature
We obtained the Interproscan IDs of domains predicted to be involved in protein interactions with high confidence from the Domine database (
Supplementary file describing methods the discretization of Amino acids, other methods of enrichment analysis, additional results of the MAXENT classifier, and performance of classifiers with and without self-interacting proteins in the positive set.
(0.05 MB PDF)
SVM AUC means as a function of increasing number of bins (k) for obtaining the AAC monomer features. The standard deviations varied in the range [0.002–0.07]
(0.33 MB TIF)
AUC means of the three classifiers (SVM, Maximum Entropy, NaiveBayes) as a function of increasing number of bins (k). The standard deviations varied in the range [0.02–0.06] for SVM, [0.02–0.07] for Naive Bayes, and [0.02–0.05] for Maximum entropy classifiers.
(0.30 MB TIF)
AUC mean of SVM classifier as a function of increasing number of bins (k) for obtaining the AAC dimer features.
(0.24 MB TIF)
Maximum Entropy classifier performance using AAC or tuple features on protein pairs with and without domains, and the complete dataset.
(0.69 MB TIF)
Performance comparison of the SVM classifier with or without the self-interacting proteins. Classifiers used either AAC monomer or domains as features.
(0.43 MB TIF)
We thank George Davidson for useful discussions, and Shawn Martin for making his signature product algorithm available to us.