The authors have declared that no competing interests exist.
Conceived and designed the experiments: CB CD OD. Performed the experiments: CB. Analyzed the data: CB CD OD. Contributed reagents/materials/analysis tools: CB CD OD. Wrote the paper: CB CD OD. Developed the semantic particularity method: CB OD.
Genetic and genomic data analyses are outputting large sets of genes. Functional comparison of these gene sets is a key part of the analysis, as it identifies their shared functions, and the functions that distinguish each set. The Gene Ontology (GO) initiative provides a unified reference for analyzing the genes molecular functions, biological processes and cellular components. Numerous semantic similarity measures have been developed to systematically quantify the weight of the GO terms shared by two genes. We studied how gene set comparisons can be improved by considering gene set particularity in addition to gene set similarity.
We propose a new approach to compute gene set particularities based on the information conveyed by GO terms. A GO term informativeness can be computed using either its information content based on the term frequency in a corpus, or a function of the term's distance to the root. We defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2. We combined our particularity measure with a similarity measure to compare gene sets. We demonstrated that the combination of semantic similarity and semantic particularity measures was able to identify genes with particular functions from among similar genes. This differentiation was not recognized using only a semantic similarity measure.
Semantic particularity should be used in conjunction with semantic similarity to perform functional analysis of GO-annotated gene sets. The principle is generalizable to other ontologies.
With the continued advance of high-throughput technologies, genetic and genomic data analyses are outputting large sets of genes. The amount of data involved requires automated comparison methods
Within a given gene set, the genes sharing identical or similar GO annotations can be grouped into clusters using two approaches
Semantic similarity measures rely on ontologies to systematically quantify the weight of the shared elements. They exploit the formal representation of the meaning of the terms by considering the relations between the terms (e.g. for inferring new annotations that were implicit as each term inherits all the properties of its ancestors) and by attributing different weights to each term depending on how much information they convey. When working with annotation databases, it should be routine practice to use the ontology hierarchy to infer implicit annotation
Node-based semantic similarity measures rely on how informative the terms are. Typically, they consider that two terms sharing an informative lowest common ancestor are more similar than two terms with a less informative lowest common ancestor. Historically, Information Content (IC) value was used to quantify how informative a term is, with the least frequent terms having the highest IC value. This concept, borrowed from Shannon's Information Theory
Edge-based semantic similarity measures use the directed graph topology to compute distances between the terms to compare. Rada distance is based on the shortest path between the two terms
Pesquita et al. also identified “hybrid” methods that combine different aspects of node-based and edge-based methods. In Wang's method
Pesquita et al. do not single out any particular semantic similarity measure as the best one, as the optimal measure will depend on the data to compare and the level of detail expected in the results. The main advantage of Wang's method compared to purely node-based methods is that the semantic value is not GOA-dependent, unlike information content. It is thus well-suited to cross-species comparisons. As cross-species comparison is one of the key stakes in biology, further development in the domain of semantic comparison should support such comparisons.
All the semantic similarity measures appear appropriate for identifying and quantifying common features. However, as these measures are focusing on common features, they may lead to an incomplete analysis when comparing genes having particular features along side similar ones
Common terms between species are displayed in blue. The terms annotating only the human ortholog are displayed in red. Part A of this figure displays the MF annotations of the human and rat orthologs of Exportin-5. Part B displays the MF annotations of the human and drosophila orthologs of Exportin-5. In this example, there is no rat nor drosophila-specific term. The semantic similarity values obtained in these cases do not reflect the difference of human particularity between each part.
We assume that considering only similarity measures is not enough to compare sets of annotations. This analysis is valid for any set of annotations that refer to an ontology. We hypothesize that gene set analysis can be improved by considering gene particularities in addition to gene similarities. We propose a general definition and some associated formal properties. We propose also a new approach based on the notion of GO term informativeness to compute gene set particularities.
The semantic particularity of a set compared to another is the value that reflects the importance of the features that belong to the first set but not the second. To compare two genes, we rely on the similarity and the respective particularities of their sets of annotations. The particularity of a gene g1 annotated by the set Sg1 compared to a gene g2 annotated by the set Sg2 depends on the annotations of Sg1 that are not related to any annotation of Sg2.
Like for semantic similarity, we compute a value bounded by 0 (least particular) and 1 (most particular). Four important properties arise from the semantic particularity definition:
The semantic particularity is non-symmetric:
Par(Sg1, Sg2) = x⇏Par(Sg2, Sg1) = x
Compared to itself, a set of annotations has no semantic particularity:
Par(Sg1, Sg1) = 0
If Sg1 =
The semantic particularity of a set of annotations Sg1 (
Par(Sg1,
And conversely:
Par(
The particularity of a set Sg1 of annotations compared to a set Sg2 does not depend on the elements of Sg2 that do not belong to Sg1:
Sg3
In order to compute the particularity of Sg1 compared to Sg2, we focus on the terms of Sg1 that are not members of Sg2. This requires to address two problems: the terms are not independent, and they do not convey the same amount of information.
Some of the terms of Sg1 that are not members of Sg2 may be linked in the graph. Taking several linked terms into account would result in considering them several times. For example, in
Using the set theory, we could define Par(Sg1, Sg2) as the proportion of elements of Sg1 that belong to MPT(Sg1, Sg2). When computing card(MPT(Sg1, Sg2)), all the elements have the same weight. However, considering the semantics underlying these elements, some of them may be more informative than others and should ideally be emphasized. Different strategies, similar to those already proposed for the computation of the semantic similarity, can be applied.
We then define PI(Sg1, Sg2), the particular informativeness of a set of GO terms Sg1 compared to another set of GO terms Sg2, as the sum of the differences between the informativeness (I) of each term
In the
We last normalize PI to compute Par(Sg1, Sg2), the semantic particularity of the set of GO terms Sg1 compared to the set of GO terms Sg2. We define MCT(Sg1, Sg2), the set of the most informative common terms of Sg1 and Sg2, as the set of the terms belonging to the intersection of Sg1* and Sg2* that do not have any descendant either in Sg1* or in Sg2*. In the
For the example of the
Several measures of informativeness have been proposed. The widely used Information Content (IC) family depends on an annotation corpus (e.g. GOA). The IC of a term t is its negative log probability
In the context of GO terms comparison, the probability of occurrence of a term
The alternative approach is corpus-independent. A term informativeness is a function of its distance to the root. It is typically used when a relevant corpus cannot be computed (for comparing elements from several species) or does not exist (for poorly studied species). Wang's Semantic Value (SV) computes this type of informativeness. The relevance of the results obtained by this approach has previously been demonstrated
Then, for each target term to compare, the semantic value is the sum of the semantic contributions of all its ancestors:
As shown in the
To study the benefits of our approach over an analysis based only on similarity, we considered three biological cases. In order to determine if we could extend Wang's initial results, our first use case was
In all these cases, we used the GOSemSim R package to compute Lin's similarity and to provide IC tables used in the computation of the IC-based particularity
We first tested our approach on the example chosen by Wang
Wang's conclusions remained true: we can still distinguish the three groups of genes involved in the three main steps of tryptophan degradation. Similarity values for the group [ARO8, ARO9] involved in the first step were 0.92. Similar results were observed for the group [ARO10, PDC6, PDC5, PDC1] involved in the second step and for the group [SFA1, ADH5, ADH4, ADH3, ADH2, ADH1] involved in the last step. The similarities measured between genes of 2 different groups (“inter-group measures”) were greater than in Wang's original study but remained lower than the intra-group comparison measures. We found the same three groups as Wang. These groups are biologically relevant because they are involved in the three steps of
We completed the previous results with the measures of semantic particularity, using Wang's Semantic Value as informativeness (
Our approach also identified a characteristic of the compared genes that the similarity ignored. Indeed, some of the genes belonging to the same group have also some particular functions (i.e. high similarity and relatively high particularity). For example, all the genes of the third group are similar. However,
The particularity of 0.388 for SFA1 compared to ADH4 is explained notably by the term “nucleotide binding”, to which the closest ancestor with ADH4 annotations is at a distance of three edges. The other red terms are also responsible for this particularity.
The similarity values show that Wang results are still valid. We also identified a benefit of using a particularity measure in addition to a similarity measure for identifying particular functions between similar genes.
In the previous case, we found an example of a relatively high particularity value between similar genes. In this second case, we aim to study a larger dataset in order to determine the frequency and the importance of this situation. We used a dataset composed by 51 well-annotated human genes involved in the aquaporin-mediated transport pathway for
BP | S-value-based particularity | IC-based particularity | ||||||
Similarity | Average | Std dev. | Min | Max | Average | Std dev. | Min | Max |
0.401 | 0.2 | 0.013 | 0.844 | 0.562 | 0.223 | 0 | 0.904 | |
0.386 | 0.174 | 0 | 0.794 | 0.532 | 0.284 | 0 | 0.89 | |
0.347 | 0.199 | 0 | 0.707 | 0.497 | 0.244 | 0 | 0.886 | |
0.352 | 0.198 | 0 | 0.798 | 0.502 | 0.241 | 0 | 0.895 | |
0.315 | 0.203 | 0 | 0.671 | 0.495 | 0.208 | 0 | 0.794 | |
0.292 | 0.145 | 0 | 0.629 | 0.437 | 0.25 | 0 | 0.882 | |
0.299 | 0.162 | 0 | 0.615 | 0.439 | 0.258 | 0 | 0.876 | |
0.229 | 0.15 | 0 | 0.529 | 0.451 | 0.216 | 0.039 | 0.839 | |
0.228 | 0.166 | 0 | 0.631 | 0.403 | 0.239 | 0 | 0.859 | |
0.22 | 0.145 | 0 | 0.501 | 0.35 | 0.233 | 0 | 0.727 | |
0.202 | 0.108 | 0 | 0.482 | 0.403 | 0.207 | 0 | 0.775 | |
0.178 | 0.118 | 0 | 0.563 | 0.319 | 0.222 | 0 | 0.671 | |
0.177 | 0.106 | 0 | 0.418 | 0.31 | 0.209 | 0.043 | 0.646 | |
0.125 | 0.071 | 0 | 0.327 | 0.258 | 0.184 | 0 | 0.589 | |
0.105 | 0.131 | 0 | 0.418 | 0.201 | 0.136 | 0 | 0.625 | |
0.061 | 0.066 | 0 | 0.248 | 0.179 | 0.123 | 0 | 0.651 | |
0.039 | 0.061 | 0 | 0.211 | 0.207 | 0.156 | 0 | 0.614 | |
0.041 | 0.067 | 0 | 0.248 | 0.193 | 0.181 | 0 | 0.572 | |
0.032 | 0.041 | 0 | 0.111 | 0.099 | 0.076 | 0 | 0.196 | |
0.005 | 0.006 | 0 | 0.015 | 0.077 | 0.152 | 0 | 0.519 |
This table gives the average, standard deviation, minimum and maximum particularity value for the BP comparisons of the case 2. The 20 categories contain all the results that range from a similarity of 0.5 to 0.999 with steps of 0.025.
MF | S-value-based particularity | IC-based particularity | ||||||
Similarity | Average | Std dev. | Min | Max | Average | Std dev. | Min | Max |
0.341 | 0.26 | 0 | 0.798 | 0.494 | 0.162 | 0.296 | 0.701 | |
0.35 | 0.219 | 0 | 0.818 | 0.429 | 0.212 | 0 | 0.703 | |
0.364 | 0.32 | 0 | 0.731 | 0.422 | 0.265 | 0 | 0.849 | |
0.382 | 0.265 | 0 | 0.694 | 0.378 | 0.148 | 0.125 | 0.591 | |
0.242 | 0.079 | 0.132 | 0.47 | 0.397 | 0.205 | 0 | 0.81 | |
0.207 | 0.113 | 0 | 0.531 | 0.302 | 0.145 | 0.158 | 0.475 | |
0.281 | 0.106 | 0.117 | 0.482 | 0.609 | 0.137 | 0.13 | 0.806 | |
0.223 | 0.181 | 0 | 0.562 | 0.453 | 0.249 | 0 | 0.763 | |
0.26 | 0.267 | 0 | 0.564 | 0.389 | 0.248 | 0 | 0.806 | |
0.179 | 0.176 | 0 | 0.482 | 0.419 | 0.211 | 0 | 0.763 | |
0.171 | 0.177 | 0 | 0.371 | 0.315 | 0.216 | 0 | 0.643 | |
0.125 | 0.167 | 0 | 0.482 | 0.33 | 0.241 | 0 | 0.777 | |
0.063 | 0.056 | 0 | 0.137 | 0.239 | 0.218 | 0 | 0.574 | |
0.119 | 0.13 | 0 | 0.415 | 0.316 | 0.222 | 0 | 0.574 | |
0.041 | 0.036 | 0 | 0.116 | 0.266 | 0.175 | 0 | 0.531 | |
0.045 | 0.05 | 0 | 0.126 | 0.179 | 0.093 | 0.086 | 0.272 | |
0.024 | 0.025 | 0 | 0.055 | 0.163 | 0.153 | 0 | 0.388 | |
0.02 | 0.026 | 0 | 0.086 | 0.09 | 0.107 | 0 | 0.272 | |
0.005 | 0.007 | 0 | 0.023 | - | - | - | - | |
- | - | - | - | - | - | - | - |
This table gives the average, standard deviation, minimum and maximum particularity value for the MF comparisons of the case 2. The 20 categories contain all the results that range from a similarity of 0.5 to 0.999 with steps of 0.025. “-” value denotes an empty category.
CC | S-value-based particularity | IC-based particularity | ||||||
Similarity | Average | Std dev. | Min | Max | Average | Std dev. | Min | Max |
0.353 | 0.233 | 0 | 0.846 | 0.621 | 0.244 | 0 | 0.911 | |
0.36 | 0.214 | 0 | 0.819 | 0.707 | 0.15 | 0.185 | 0.977 | |
0.33 | 0.187 | 0 | 0.799 | 0.64 | 0.202 | 0 | 0.897 | |
0.341 | 0.185 | 0 | 0.752 | 0.613 | 0.194 | 0 | 0.896 | |
0.317 | 0.183 | 0 | 0.754 | 0.621 | 0.165 | 0 | 0.888 | |
0.268 | 0.18 | 0 | 0.706 | 0.592 | 0.207 | 0 | 0.852 | |
0.28 | 0.177 | 0 | 0.656 | 0.553 | 0.227 | 0 | 0.888 | |
0.24 | 0.177 | 0 | 0.583 | 0.495 | 0.241 | 0 | 0.845 | |
0.13 | 0.159 | 0 | 0.543 | 0.466 | 0.24 | 0 | 0.825 | |
0.196 | 0.151 | 0 | 0.579 | 0.428 | 0.268 | 0 | 0.82 | |
0.134 | 0.122 | 0 | 0.484 | 0.383 | 0.246 | 0 | 0.819 | |
0.15 | 0.127 | 0 | 0.489 | 0.391 | 0.267 | 0 | 0.768 | |
0.144 | 0.093 | 0 | 0.269 | 0.19 | 0.187 | 0 | 0.625 | |
0.133 | 0.123 | 0 | 0.421 | 0.352 | 0.231 | 0 | 0.73 | |
0.146 | 0.152 | 0 | 0.373 | 0.255 | 0.216 | 0 | 0.624 | |
0.051 | 0.051 | 0 | 0.11 | 0.145 | 0.152 | 0 | 0.381 | |
0.067 | 0.085 | 0 | 0.269 | 0.095 | 0.095 | 0 | 0.189 | |
- | - | - | - | - | - | - | - | |
- | - | - | - | 0.131 | 0.131 | 0 | 0.262 | |
0.012 | 0.012 | 0 | 0.024 | 0.049 | 0.049 | 0 | 0.098 |
This table gives the average, standard deviation, minimum and maximum particularity value for the CC comparisons of the case 2. The 20 categories contain all the results that range from a similarity of 0.5 to 0.999 with steps of 0.025. “-” value denotes an empty category.
The relatively high particularity between similar genes that we observed in case 1 is confirmed in this case 2. In each 20 categories in the human aquaporin-mediated transport pathway, some of the genes have an important particularity compared to the others. Again, these genes cannot be identified using only a similarity measure.
Part A: AQP8 and AQP5 share most of their annotations. Part B: AQP6 and AQP3 share numerous molecular functions, but each gene also have particular functions.
SV-based | AQP6 | AQP3 | IC-based | AQP6 | AQP3 | ||
2*Sim | AQP6 | 1 | 0.696 | 2*Sim | AQP6 | 1 | 0.81 |
AQP3 | 1 | AQP3 | 1 | ||||
2*Par | AQP6 | 0 | 0.247 | 2*Par | AQP6 | 0 | 0.531 |
AQP3 | 0.415 | 0 | AQP3 | 0.388 | 0 |
SV-based | AQP8 | AQP5 | IC-based | AQP8 | AQP5 | ||
2*Sim | AQP8 | 1 | 0.704 | 2*Sim | AQP8 | 1 | 0.8 |
AQP5 | 1 | AQP5 | 1 | ||||
2*Par | AQP8 | 0 | 0 | 2*Par | AQP8 | 0 | 0 |
AQP5 | 0.19 | 0 | AQP5 | 0.13 | 0 |
The similarity between AQP6 and AQP3 is very close to the similarity between AQP8 and AQP5 regardless the method used (SV or IC-based). However, the particularity profile obtained for each couple is very different. Again, the SV-based and IC-based methods led to the same conclusion.
These results confirm that among similar genes, some also have some particular functions, and show that this situation can be observed throughout the full range of similarity values. Therefore, the situation described in the first use case was not an isolated case.
The previous cases focused on the similarity and particularity of different genes in a same pathway. In this third case, we compared homolog genes across different species. IC-based methods cannot be used in this cross-species context. To investigate the frequency of similar homolog genes and the frequency of homolog genes having particular functions, we computed Wang's semantic similarity and SV-based particularities for each group of the HomoloGene database. The August 2013 version of this database contained 43,074 groups of homolog genes. Each group contained from 2 to 839 genes (average: 6.02, standard deviation: 7.46). We computed all the 5,531,994 intra-group similarity and particularity measures.
Branch of GO | BP | MF | CC | All |
Number of comparisons | 1,843,998 | 1,843,998 | 1,843,998 | 5,531,994 |
Only one gene is annotated | 511,899 | 574,815 | 581,819 | 1,668,533 |
No annotated gene | 939,010 | 823,444 | 887,419 | 2,649,873 |
Two genes annotated | 393,089 | 445,739 | 374,760 | 1,213,588 |
Sim |
287,288 | 396,412 | 314,572 | 998,272 |
Sim |
39,312 | 20,754 | 32,531 | 92,597 |
Sim |
410 | 91 | 54 | 555 |
Sim<0.5 | 66,079 | 28,482 | 27,603 | 122,164 |
The five last lines refer to valid comparisons where the two genes were annotated.
To be valid, a comparison has to involve two annotated genes. Overall, 21.94% of the comparisons were valid. For BP, CC and MF, we used the number of valid comparisons as the baseline to analyze the different configurations of similarity and particularity. We focused on these valid comparisons and found that 89.93% of them had a similarity greater than or equal to 0.5. In 82.26%, the genes were similar and had particularities lower than 0.5. Although there were differences between BP, MF and CC, on the whole HomoloGene database, the particularity values allowed us to identify 7.63% of the valid comparisons that denote similar genes, one of these genes having a particularity greater than 0.5.
As an example illustrating the results, we analyzed the comparisons of the GO molecular functions associated to Exportin-5 orthologs for 9 species (
Altogether, the case 3 results showed that ortholog genes were, as expected, mostly similar. We have also demonstrated that some of them may have high particularity values that denote particular functions. Last, some orthologs may have diverged to present a low similarity and high particularities.
Semantic similarity measures have been extensively used for comparing genes and gene sets
We based our semantic particularity measure on the concept of informativeness of GO terms. This informativeness can either be an Information Content (IC)
The interpretation of the similarity and particularity values depends on the number and quality of the annotations. If at least one of two genes has few annotations, the similarity and particularity values will suffer from a lack of precision (the values are sensitive to the addition of new annotations) regardless of their accuracy.
Furthermore, annotations are associated with different Evidence Codes (EC), ranging from automatic inference to experimental validation. The biological interpretation of similarity and particularity values is more convincing when their computation refers to experimentally-confirmed annotations. However, electronically-inferred annotations may still yield valid similarity and particularity values. As the GO consortium recommends against using EC as a measure of quality of the annotation
Particularity refined the similarity-based analysis by identifying some couples of similar genes with important particularities. All three use cases illustrated this point in intra-species or in cross-species.
In the first case study on the
We have gone further in the case 2, comparing 51 genes that belong to a same human pathway. With this case, we wanted to see three things. First, we wanted to know whether the observations made in the first case remained true on a bigger example. They did. Then, we wanted to assess the effect of the kind of informativeness used. Semantic value and information content gave different semantic similarity and particularity values, but they leaded to the same conclusions. Consequently, the choice of this method only depends on the data we want to compare. IC can be used as an informativeness measure if the data are relative to one single species and if this species is sufficiently annotated to offer a meaningful corpus. Otherwise, the best informativeness measure may be the semantic value. Last, we wanted to assess our conclusions on the three branches of Gene Ontology. Concerning this point, we obtained high particularity values between similar genes regarding any branch of GO.
The third case showed comparisons of ortholog genes that also resulted in interesting sub-cases with high-similarity profiles. As suspected, the results confirmed that ortholog genes are mostly similar. Moreover, particularity measures made it possible to observe that among the pairs of similar genes, some are composed of at least one gene having also an important particularity. Indeed, among the 1,213,588 valid comparisons across the whole HomoloGene database, we identified 93,152 (7.68%) comparisons for which the genes were similar, but at least one of them had an important particularity, denoting some particular function(s). This confirm the observations made in the cases 1 and 2. These 7.68% of valid comparisons resulting in the identification of genes having some particular features, which however have enough common GO annotation to remain similar are biologically very interesting. This demonstrates the benefit s of using the semantic particularity measure in addition to semantic similarity.
In the third case, we developed the Exportin-5 example to illustrate the limitations of the semantic similarity measures. The results of a similarity measure did not reflect that the amount of particular functions while comparing the human gene to the drosophila ortholog (“tRNA binding” and four of its ancestors are human-specific) is greater than while comparing it to the rat ortholog (only “protein binding” is human-specific). The particularity measure showed that the human and drosophila Exportin-5 orthologs are not only similar, but that some quantifiable features are in reality very specific to the human gene. Furthermore, the high particularity of these orthologs is consistent with the results of Shibata et al., who demonstrated that Exportin-5 orthologs are functionally divergent among species
The case studies showed that combining similarity and particularity makes it possible to identify some genes' particular functions that cannot be distinguished using similarity only. These particular functions may be the result of a real biological difference, a default of annotation, or a combination of both. If we suspect a default of annotation, the results should be interpreted carefully until the annotations are improved.
In the case 3, the number of annotations vary between the compared orthologs. On the one hand, the results can reflect a real particularity of function for some genes. On the other hand, the high particularity of a gene can be the result of a lack of annotations of the other gene. For example, when comparing MF annotations for hsa and ath orthologs of Exportin-5, we observed very high particularities for both species (respectively 0.641 and 0.871). We consider these results to be relevant, as the genes of both species are well annotated (11 annotations in the expanded set of hsa, 18 annotations in the expanded set of ath). Conversely, care is warranted when interpreting the particularity of hsa over Canis canis (cca). For these species, sim(hsa, cca) = 0.428, spe(hsa, cca) = 0.611 and spe(cca, hsa) = 0. However, the expanded set of annotations for the cca ortholog had only 4 terms compared to 11 for hsa. In this case, the high particularity of hsa could be attributed to the lack of cca annotations.
We showed that gene set analysis can be improved by considering gene-set particularities in addition to their similarity. We proposed a set of formal properties and a new GO semantic measure to compute gene-set particularity. We first showed that particularity is a useful complement to similarity for comparing gene sets, making it possible to detect similar gene sets for which one of the sets also had some particular functions, and to identify these functions. We also showed that using particularity also improves gene clustering. Our particularity measure relies on the informativeness of GO terms. This informativeness of a term can be its Information Content or its Semantic Value. In this paper, we combined our particularity measure with a similarity measure to compare genes annotated GO terms, but this same principle can be generalized to other ontologies.
(ZIP)
(TIF)
(TIF)
(TIF)