Conceived and designed the experiments: SR MRS. Performed the experiments: TM JCM. Analyzed the data: SR TM JCM PG MRS. Contributed reagents/materials/analysis tools: AO PG. Wrote the paper: SR TM MRS.
The authors have declared that no competing interests exist.
Minimotifs are short contiguous peptide sequences in proteins that are known to have a function in at least one other protein. One of the principal limitations in minimotif prediction is that false positives limit the usefulness of this approach. As a step toward resolving this problem we have built, implemented, and tested a new data-driven algorithm that reduces false-positive predictions.
Certain domains and minimotifs are known to be strongly associated with a known cellular process or molecular function. Therefore, we hypothesized that by restricting minimotif predictions to those where the minimotif containing protein and target protein have a related cellular or molecular function, the prediction is more likely to be accurate. This filter was implemented in Minimotif Miner using function annotations from the Gene Ontology. We have also combined two filters that are based on entirely different principles and this combined filter has a better predictability than the individual components.
Testing these functional filters on known and random minimotifs has revealed that they are capable of separating true motifs from false positives. In particular, for the cellular function filter, the percentage of known minimotifs that are not removed by the filter is ∼4.6 times that of random minimotifs. For the molecular function filter this ratio is ∼2.9. These results, together with the comparison with the published frequency score filter, strongly suggest that the new filters differentiate true motifs from random background with good confidence. A combination of the function filters and the frequency score filter performs better than these two individual filters.
Minimotifs are short contiguous peptide pieces of proteins that have a known biological function. These functions can be categorized into binding, posttranslational modification of the minimotif, and protein trafficking. While there are many known functional minimotifs, predicting a minimotif in a new protein based on a consensus sequence, position-specific scoring matrix, or other algorithms produces many false-positive predictions. This limits the usefulness of minimotif prediction programs such as Minimotif Miner (MnM)
To reduce false positive minimotif predictions, three approaches have been used in MnM
Despite these inter-related approaches, false positives remain a concern, thus new types of filters are needed. In considering new strategies that might be used to refine minimotif predictions, we have noticed that some proteins which contain a particular domain are thought to have functions related to similar cellular processes. For example, proteins, which contain PTB or SH2 domain that binds to phospho-tyrosine containing proteins, are typically involved in signaling
The Gene Ontology (GO) database
We have built a syntactical and semantic structure for minimotifs that enables easy integration of minimotif and GO data
For predictions of new minimotifs, the source protein query contains multiple minimotif short sequences that may encode new activities with all of the target proteins. In this paper, the source protein and the set of putative target proteins can be mapped to cellular and molecular functions derived from the GO database to determine whether the source and target proteins share at least one common or similar cellular function and/or molecular function. This approach was first tested on the Minimotif Miner database of experimentally verified minimotifs. Analysis of several variations of the algorithm demonstrates that this approach can reduce false positive minimotif predictions in the test dataset and eliminate many predictions in a set of randomly selected query proteins.
Since there are many false positives in minimotif prediction programs, any means of selecting minimotifs with a higher probability of being true is desired. The molecular function filter does provide this advantage. Another important aspect of the filters presented in this paper is that they segregate minimotifs into groups for uses. With the cell function algorithm users can choose minimotifs for target proteins that are involved in the same cellular process or in a different cellular process. For example if the query protein is involved in cell division, one user may want to only look for minimotif predictions for other proteins involved in cell division or may want to identify predictions that are involved in other cellular processes. We have also implemented this for molecular functions as well.
Another important contribution of this paper is the novel conclusion that it may be possible to combine more than one filters to get another filter whose performance is better than that of the individual filters. In particular, we have devised two combinations. The first combination has the molecular function and the frequency score filter and the second combination has the cellular function filter and the frequency score filter. The new combination filters have much better p-values than all the component filters involved.
To reduce the false positives in the minimotif predictions by MnM with cellular/molecular function information, we obtained this functional data from the GO database. We selected the GO database for this purpose because it has the largest ontologies for these functions and has relationships between functions. The GO ontology (4/09 release) has 16,698 terms and 32,719 edges for biological processes/cellular functions and 9309 terms and 9,924 edges for molecular functions. The edges for functional relationships are directed from the juxtaposed node to the larger node for two neighboring terms. Because identical proteins in the MnM and GO databases may have different accession numbers, we used an alias table to map these accession numbers to the cellular/molecular functions of each protein.
To test the effectiveness of the filter algorithms, we ideally needed to compare a dataset of verified minimotifs to known negatives. For experimentally verified minimotifs we used the Minimotif Miner 2 database (MnM 2), for which the total number of minimotifs is ∼5300
We did not have access to known negative minimotifs, so we generated a dataset that will serve as “negative” interactions that are comprised of proteins that are most likely not to interact. There are ∼500,000 known protein-protein interactions for >5,000 total proteins, but if all possible pairwise interactions are considered, then the number of true minimotifs is a very small fraction of the total possible number of all interacting protein pairs. For example, if there are ∼30,000 proteins for the ∼500,000 interactions, then the total number of possible pairing is 449,985,000. Thus, it is safe to assume that choosing randomly generated protein pairings represents “negative” minimotifs. Therefore, 20,000 entries of random pairs of source proteins and target proteins were sampled. Of these pairs, 3,153 had at least one cellular function and 3,706 of the pairs had at least one molecular function in the GO dataset. These entries were used as the “Negative” datasets.
The basic function filter algorithms test whether at least one common or similar cellular/molecular function is shared by the given minimotif source protein and target protein. Given the minimotif source protein
This algorithm was applied to the above datasets for molecular and cellular functions. To evaluate the efficacy of the algorithms we used two metrics. The percentage for the experimentally verified minimotifs not removed by the filter is the
Sensitivity and selectivity both range from 0% to 100% with 100 % indicating complete recovery of the experimental minimotifs or of negative minimotifs. A
The cellular function algorithm had a sensitivity of ∼11% and a selectivity of ∼3%, with a DR of 3.9, indicating that many motifs were recovered and there was a ∼4-fold preference for retaining a verified instance over a randomly selected negative instance. The molecular function algorithm had a sensitivity of ∼29% and a selectivity of ∼13%, with a
While the Cellular and Molecular function algorithms have value in reducing false positives, the structure of the GO database provided us with an opportunity to vary the stringency of function assignment and optimize these algorithms. GO contains neighborhood information of each cellular/molecular function term. The nodes are cellular function terms or molecular function terms in this case, and the edges go from the children nodes to parent nodes. So the “at least one common function” becomes “at least one similar enough function”. That is to say, the predicted target proteins are restricted to those for which at least one cellular/molecular function is similar enough to one in minimotif source protein, or the distance between at least one cellular/molecular function of the target protein and that of the source protein is small enough. We have introduced a distance threshold into the basic algorithm.
The expended algorithm works as follows: given the distance threshold
The basic algorithms used a distance threshold of 0; here we tested 5 additional distance thresholds of 1, 2, 3, 4, and 5. Results from the evaluation of the cellular function filter are shown in
Distance | sensitivity | selectivity | DR |
0 | 11% | 3% | 3.8 |
1 | 26% | 6% | 4.6 |
2 | 48% | 14% | 3.4 |
3 | 65% | 32% | 2.0 |
4 | 82% | 58% | 1.4 |
5 | 90% | 79% | 1.2 |
To test the statistical significance of the filters we have used ROC curves and p-values. We have employed the programs of the R project
ROC curves for the molecular (A) and cellular (B) function filters, as well as the frequency score filter are shown. Analysis was with the minimotifs in the MnM 2 database that have known molecular and cellular functions in the GO database (A,B).
Results from the evaluation of the molecular function filter are shown in
distance | sensitivity | selectivity | DR |
0 | 29% | 12% | 2.3 |
1 | 59% | 21% | 2.9 |
2 | 82% | 35% | 2.3 |
3 | 91% | 50% | 1.8 |
4 | 94% | 61% | 1.6 |
5 | 96% | 72% | 1.3 |
Both filters have value in reducing false-positives in the test datasets and stringency of predictions can be controlled by selecting distances between 0 and 3, whereas the performance of the algorithms degrades at distance values above 3. The above results indicate that the filters differentiate verified data from negative data with a good confidence and strongly suggest when predicting novel minimotifs these filters would help to decrease the number of false-positive predictions.
We wanted to compare the performance of the new filters with one of the already existing MnM filters, namely, the frequency score filter. To begin with we have plotted the ROC curve for the frequency score filter. This ROC curve is shown in
Cellular Function | Molecular Function | Frequency Score | MF-FS Combination | CF-FS Combination | |
0.72 | 0.83 | 0.72 | 0.89 | 0.87 | |
0.12 | 0.03 | 0.08 | 0.002 | 0.0002 |
A novel contribution of this paper is the conclusion that a combination of several filters can yield a better predictability than the individual filters. In particular, we have devised two combination filters. The first combination filter employs the molecular function and the frequency score filters. Note that these two filters are based on two different principles. The frequency score is based on the number of occurrences of the predicted motif whereas the molecular function filter is based on whether the source and target proteins share a common molecular function. Our tests of the combined filter indicate that the combined filter has a better p-value than the two individual filters.
We have employed the either-or-based combination of the molecular function filter and frequency score filter, in the expectation that the two filters can complement each other in some way, which is reasonable since they focus on different aspects and therefore the combined filter may outperform any of the two. Given a motif of some source protein, associated with its target protein, the combined filter examines whether the source and target proteins are retained by the molecular function filter, as well as whether the motif and source are retained by the frequency score filter. If either filter retains them, the combined filter retains them.
This idea was tested on the same positive dataset and negative datasets. The positive datasets have already got experimentally verified entries of motif, its source protein and the associated target protein. For the negative datasets, which are 20,000 random protein pairs, we threw one of each protein pair into Minimotif Miner (MnM)
Combination of molecular function and frequency score filters (A) and combination of cellular function and frequency score filters (B) are shown. These ROC curves have been obtained by combining the two pairs of filters on an either-or basis.
thresholds | ||||
0.02 | 0.03 | 0.04 | ||
28% | 28% | 28% | ||
63% | 63% | 63% | ||
88% | 88% | 88% | ||
19% | 16% | 15% | ||
27% | 24% | 23% | ||
41% | 39% | 38% |
The second combination filter employs the cellular function filter and frequency score filter in the same way. Considering the cellular function filter is more stringent, five distances (0, 1, 2, 3, 4) were used, together with the same three thresholds (0.02, 0.03, 0.04) for frequency score filter. As a result, fifteen threshold parameters were formed for this combination of cellular function and frequency score filters. To smoothen the ROC curve, very small noises were also added, which is no more than 6.743894e−11. The prediction of this combination is shown in
thresholds | ||||
0.02 | 0.03 | 0.04 | ||
17% | 17% | 17% | ||
44% | 44% | 44% | ||
75% | 75% | 75% | ||
88% | 88% | 88% | ||
95% | 95% | 95% | ||
9% | 6% | 5% | ||
12% | 9% | 8% | ||
20% | 17% | 16% | ||
37% | 34% | 34% | ||
62% | 60% | 60% |
We have implemented these new filters with the other filters on the MnM 2 website (
All filters in this paper are now included as part of the MnM website. The option to select minimotifs that have similar or dissimilar functions is implemented.
We wanted to examine how many predicted minimotifs were filtered by the algorithms. We ran the filter on P53, Cyclin A, and MSH2, which each have different molecular and cellular functions (22 more proteins were tested and are shown in
Cellular function | Molec. function | |||||
Protein | RefSeq | Threshold | Retained | Retained | ||
p53 | NP_035770 | 0 | 64 | 10 | 67 | 46 |
p53 | NP_035770 | 1 | 64 | 33 | 67 | 53 |
p53 | NP_035770 | 2 | 64 | 52 | 67 | 63 |
p53 | NP_035770 | 3 | 64 | 61 | 67 | 64 |
p53 | NP_035770 | 4 | 64 | 64 | 67 | 65 |
p53 | NP_035770 | 5 | 64 | 64 | 67 | 65 |
Cyclin A | NP_003905 | 0 | 81 | 3 | 82 | 38 |
Cyclin A | NP_003905 | 1 | 81 | 6 | 82 | 51 |
Cyclin A | NP_003905 | 2 | 81 | 23 | 82 | 65 |
Cyclin A | NP_003905 | 3 | 81 | 40 | 82 | 69 |
Cyclin A | NP_003905 | 4 | 81 | 64 | 82 | 72 |
Cyclin A | NP_003905 | 5 | 81 | 77 | 82 | 75 |
MSH2 | NP_000242 | 0 | 76 | 8 | 80 | 25 |
MSH2 | NP_000242 | 1 | 76 | 15 | 80 | 52 |
MSH2 | NP_000242 | 2 | 76 | 34 | 80 | 66 |
MSH2 | NP_000242 | 3 | 76 | 62 | 80 | 74 |
MSH2 | NP_000242 | 4 | 76 | 73 | 80 | 76 |
MSH2 | NP_000242 | 5 | 76 | 75 | 80 | 77 |
*Totals do not include minimotifs for which no GO terms are assigned to the proteins.
It is important to increase the efficiency and specificity of minimotif prediction. Many minimotif filters increase the specificity of minimotif predictions. Over time the collective use of a set of well-developed filters such as the ones we present here will lead to accurate computational tools. This is not just true for minimotifs, but for transcription factor binding sites as well. Incremental development of algorithms is a standard in computational biology.
We have reported two new filters for the elimination of false positives in minimotif predictions. Our testing results reveal that these filters are indeed effective. The use of these filters seems to be a logical approach for reducing false positives. If two proteins are involved in the same cellular or molecular function, they may be in the same or redundant pathways. However, if one contains a minimotif that is the target of another protein in the pathway, then this provides a second piece of data suggesting a functional relationship between the two proteins.
The cell function filter, eliminated 90–95% of the predictions for the 3 proteins we tested. This is the most stringent filter we have come across in the other filters designed for MnM. The frequency filter, surface prediction filter, and evolutionary conservation filters all showed a preference for filtering false positives, but not to the extent seen for the cellular function filter. The molecular function filter, while not as stringent as the cellular function filter, also performed better than previous filters implemented in MnM. This suggests that other data-driven minimotif filters used by themselves, or in combination may provide a good approach for reducing false positives. This does come at a cost, as a percentage of true minimotifs may be filtered.
We have been running Minimotif Miner for 4 years and one of the major difficulties for users is that when a list of potential target names is presented to them, most scientists do not have a knowledge-base to understand all of the different functions in the potential targets and this makes it difficult to select minimotifs for experimental testing. The new functional filters help to alleviate this problem, by restricting the predictions to those functions that are related to the query protein. In the case where a user wants to know new functions of their query, they can use the “exclude” filter, to identify only those functions that are not previously related to the query. In conclusion, the functional filters provide a valuable tool for reducing false-positive prediction of minimotifs.
Supporting information document.
(0.23 MB DOC)
We would like to thank the Minimotif Miner team for daily input in preparation of the data for this paper.