System-Level Insights into the Cellular Interactome of a Non-Model Organism: Inferring, Modelling and Analysing Functional Gene Network of Soybean (Glycine max)

Yungang Xu; Maozu Guo; Quan Zou; Xiaoyan Liu; Chunyu Wang; Yang Liu

doi:10.1371/journal.pone.0113907

Abstract

Cellular interactome, in which genes and/or their products interact on several levels, forming transcriptional regulatory-, protein interaction-, metabolic-, signal transduction networks, etc., has attracted decades of research focuses. However, such a specific type of network alone can hardly explain the various interactive activities among genes. These networks characterize different interaction relationships, implying their unique intrinsic properties and defects, and covering different slices of biological information. Functional gene network (FGN), a consolidated interaction network that models fuzzy and more generalized notion of gene-gene relations, have been proposed to combine heterogeneous networks with the goal of identifying functional modules supported by multiple interaction types. There are yet no successful precedents of FGNs on sparsely studied non-model organisms, such as soybean (Glycine max), due to the absence of sufficient heterogeneous interaction data. We present an alternative solution for inferring the FGNs of soybean (SoyFGNs), in a pioneering study on the soybean interactome, which is also applicable to other organisms. SoyFGNs exhibit the typical characteristics of biological networks: scale-free, small-world architecture and modularization. Verified by co-expression and KEGG pathways, SoyFGNs are more extensive and accurate than an orthology network derived from Arabidopsis. As a case study, network-guided disease-resistance gene discovery indicates that SoyFGNs can provide system-level studies on gene functions and interactions. This work suggests that inferring and modelling the interactome of a non-model plant are feasible. It will speed up the discovery and definition of the functions and interactions of other genes that control important functions, such as nitrogen fixation and protein or lipid synthesis. The efforts of the study are the basis of our further comprehensive studies on the soybean functional interactome at the genome and microRNome levels. Additionally, a web tool for information retrieval and analysis of SoyFGNs can be accessed at SoyFN: http://nclab.hit.edu.cn/SoyFN.

Citation: Xu Y, Guo M, Zou Q, Liu X, Wang C, Liu Y (2014) System-Level Insights into the Cellular Interactome of a Non-Model Organism: Inferring, Modelling and Analysing Functional Gene Network of Soybean (Glycine max). PLoS ONE 9(11): e113907. https://doi.org/10.1371/journal.pone.0113907

Editor: Henry T. Nguyen, University of Missouri, United States of America

Received: June 23, 2014; Accepted: October 24, 2014; Published: November 25, 2014

Copyright: © 2014 Xu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are within the paper and its Supporting Information files, as well as Dryad (doi: 10.5061/dryad.0rv1m).

Funding: This work is supported by the Natural Science Foundation of China (91335112, 61370010, 61271346, and 61172098); and the Specialized Research Fund for the Doctoral Program of Higher Education of China (20112302110040). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The living body is a complex system of storing and processing information. Full understanding of this system means characterising the function of its components and their interactions. The cell, as the most basic system of life, is a system of hierarchical organisation from individual molecules (such as genes, mRNAs, proteins, and metabolites) to complex molecular pathways (such as gluconeogenesis and tricarboxylic acid cycle), in which molecular interactions play an important role. Interacting molecules form functional modules (such as groups of molecules involved in the same biological process), which in turn interact with each other to drive larger scale biological processes. Comprehensive maps of the interactions among biomolecules provide an overall view of the cell. The past decade has witnessed significant effort aimed at modelling, identifying, organising, and analysing cellular interactomes. Such effort, grounded in significant advances in our understanding of molecular biology, is supported by the omic-level high-throughput data collections and acquisition techniques, which are used to interrogate the states and interactions of biomolecules at multiple levels, and to further map the structure of the genome-wide interaction networks.

If the complex system of a cell is regarded as a gene society, although it is in fact composed of a variety of biological molecules, the heterogeneous interactions between biological molecules are, essentially, interactions between genes. A same gene society may be modelled by various networks, of which the most popular are the protein-protein interaction network (PPIN), gene regulatory network (GRN, or transcriptional regulatory network, TRN) and metabolic network (MN). In addition, there exist various other types of connections upon which to model gene interactions, such as signal transduction pathways, co-expression networks, genetic interactions, and so forth (Figure 1). However, these models characterise the different interactive relationships between genes, implying their unique intrinsic properties and defects, and covering different slices of biological information. In other words, one specific type of connection alone cannot explain the various interactions among genes. Integrating them would contribute to a comprehensive view of the cellular system. Therefore, a challenging problem of network integration arises.

Download:

Figure 1.

Various types of interactions between genes and a schematic view of the workflow for constructing the probabilistic functional gene networks (PFGNs).

https://doi.org/10.1371/journal.pone.0113907.g001

Some pioneering approaches have arisen to combine networks of different interaction types defined on the same sets of nodes, with the goal of identifying functional modules supported by multiple types of interactions. The functional gene network is such a consolidated interaction network that models fuzzy and a more generalised notion of gene-gene relations. Further, the strength of interaction between any two genes indicates the level of confidence in the functional coupling between the two genes. Insuk Lee and Edward M. Marcotte along with their colleagues [1],[2] first proposed a complete description and construction of the FGNs. They represented the specific types of interactions between genes by a more inclusive type of relations, functional interactions. The consolidation of various types of interactions with the use of the more inclusive functional interactions results in more extended coverage of genome by the gene network (Figure 1). Such consolidated interaction networks are modelled in the form of weighted graphs, where edge weights represent the likelihood of interaction between genes, estimated on the basis of various statistical models and techniques. Such a network is referred to as a probabilistic functional gene network (PFGN) [3]. So far, PFGNs have been successfully constructed for unicellular organism yeast (S. cerevisiae) [2],[4], the invertebrate nematode (C. elegans) [5],[6], the model plants Arabidopsis mustard (A. thaliana) [8],[8] and rice (O. sativa) [9], the mammal mouse (M. musculus) [10]–[12] and even the human species (H. Sapiens) [13].

Although reconstruction of FGNs, depending on a variety of function-associated data (Figure 1), has been successful in many model plant species, especially, for example, the dicot Arabidopsis [7] and the monocot rice [9], integrating diverse genomic data into network models for many other plants, such as soybean, is still problematic. First, the genomic data are heterogeneous in their sensitivity and specificity for relationships between genes. For example, experimental methods such as mass spectrometry preferentially observe abundant proteins, whereas comparative genomics methods apply only to evolutionarily conserved genes. Second, genomic data sets vary widely in their utility for reconstructing gene networks. Thus, we need robust benchmarking methods that can be used to evaluate each data set and allow comparison of their relative merits. Third, data sets are often correlated, but the correlations are always difficult to measure because of data incompleteness (a common problem) and sampling biases [4]. For most species, the richness and accuracy of these various function-associated data are quite inconsistent. For example, for model organisms, such as Arabidopsis, a wealth of data resources is available owing to extensive research, but for other non-model organisms, such as soybean, there are not enough data to construct such networks. We therefore need a cross-species and minimally data-dependent approach to construct the FGNs of non-model organisms.

The Gene Ontology (GO) project [14] has integrated information from multiple data sources to annotate genes to specific biological process (BP), molecular function (MF) or cellular component (CC), which are three sub-ontologies (or aspects). GO annotation (GOA) itself can be regarded as a de facto way to integrate diverse unstructured data into a single structured data source. Therefore, GOA is important for inferring FGNs based on the fact that the strength of functional interaction between genes is proportional to their functional similarity (FS). Thus we can calculate the FS among all the genes of an organism based on GOA and further construct a genome-wide network, referred to as an FGN. As a weighted network model, edge weights in the FGN represent the functional similarity rather than the likelihood of interaction between genes in a PFGN.

In comparison to the PFGN, the FGN based on GOA seems to be much easier to construct. However, construction of such a genome-wide FGN for soybean is challenging for several reasons. First, whereas A. thaliana has ≈27,000 protein coding genes (The Arabidopsis Information Resource, release 9) [15], soybean is predicted to have 46,430 protein coding genes, 70% more than Arabidopsis [16], but it in fact has 54174 protein-coding genes annotated by EnsemblPlants, as of May 2013 (v1.0, JGI-Glyma-1.1). This increased genome complexity results in a combinatorial explosion for the number of pairwise relations between genes (theoretically ≈1.5 billion pairs in total but actually we computed more than 2.7 billion pairs because of the three aspects of GO), complicating discovery of true functional associations. Second, the current reference knowledge and raw omic data available for modelling gene interactions are much sparser for soybean than for Arabidopsis, reducing the predictive power of resulting networks and increasing the difficulty of evaluating this power. Despite these hurdles, we constructed the first version soybean FGNs, called SoyFGNs, using the three aspects of GOA published by UniprotKB in September 2012 (version 111), which cover ≈70% of the 54174 soybean genes (Ensembls) recorded by EnsemblPlants. The construction of the second version SoyFGNs covering all 54174 genes is under way. The entire construction process described below includes the following steps: 1) measuring the pairwise functional similarities of genes annotated by GO; 2) setting a threshold to determine how similar in function the gene pairs should be to be connected in the network; 3) dissecting the validity of SoyFGNs by topology analysis, comparative analysis and functional verification.

Material and Methods

Datasets

Gene ontology (GO).

The GO data were downloaded from the Gene Ontology website [17] (data version: 1.1.3499), excluding cross-products, inter-ontology and “has-part” relationships. This dataset contains 38137 terms, including 1692 obsolete terms. The total valid terms in BP, MF and CC number 23928, 9467 and 3050, respectively. The “is-a” and “part-of” relationships number 56718 and 6127, respectively.

GO annotations (GOA).

The GOAs of soybean (Glycine max) were downloaded from UniProt-GOA (http://www.ebi.ac.uk/GOA/, version 111). A total of 165040 annotations annotate 37827 (∼70%) of the 54174 soybean genes (recorded by EnsemblPlants, release 18 April 2013). The entries annotated in BP, MF and CC number 47452, 92374 and 25214, respectively. The genes annotated in BP, MF and CC number 27594, 33189 and 14150, respectively. Here we use UniprotKB AC/IDs or Ensembl Genome IDs to represent corresponding genes.

Functional similarities of pairwise genes

We previously proposed a shortest semantic differentiation distance (SSDD) method to calculate the semantic similarity between GO terms from a novel perspective [18]. An overlapping directed acyclic graph (DAG, a sub-graph of GO) was generated to represent two given terms. Such a DAG was then viewed as a semantic genealogy wherein a term inherits the semantics of its ancestors and distributes it to its descendants. We introduced the concept of semantic differentiation to represent the transition of a term from one pattern of semantic integration to another and the concept of semantic totipotency to represent the capacity of this differentiation. Taking into account all paths linking a term and its ancestors, the semantic totipotency of a given term is quantified as a T-value () as follows:(1)where represents a root term. The semantic totipotency of the three root terms is given as 1. The variable is the semantic differentiation factor for edge linking term with its parent . The T-values of any other terms are derived as the average of all of its parents' T-values multiplied by the semantic differentiation factor (). The differentiation capacity () should decrease moving down the hierarchy and be positively proportional to the number of descendants, or local density. Thus, the between a term and its parent should be greater than 0 and less than 1, and can be calculated as(2)where is the number of descendants of the term , including itself.

Based on T-values, we proposed the SSDD to measure the semantic similarity in the GO. Given two terms and , the normalised distance between them is defined as(3)where represents a set of terms on the shortest path connecting the terms and via their lowest common ancestors(LCAs). The arctan function is used to normalise the distance to (0, 1). Apparently, is symmetric, i.e. . After normalisation, the semantic similarity is defined as:(4)

SSDD was shown to be effective for measuring the semantic similarity of pairwise GO terms. We also need a method for integrating pairwise semantic similarities into a single FS of genes because a gene is often annotated by more than one term in GOA. Three distinct approaches have been proposed for this integration: Lord et al. [19],[20] used an arithmetic average (Avg) of pairwise similarities between all terms of the first protein set and the second one; Sevilla et al. [21] used only the maximum (Max) similarity between all term pairs; Couto et al. [22], Schlicker et al. [23] and Azuaje et al. [24] developed the best-match average (BMA) method, in which each term of the first protein is paired only with the most similar term of the second one and vice versa. We take the BMA approach to compare gene similarities, as it was found to be most effective [25]. Given two genes, and , BMA is defined as(5)

where () denotes a term that belongs to the term set with a size of m(n) that annotates (). Thus, each gene pair is assigned three FSs based on three orthogonal aspects of GO. We also need a single integrated FS for each gene pair (denoted by ). Thus, we calculate the weighted average of the three FSs as their integration (hereinafter denoted by INT), which can be formulated as(6)where, , , and are three FSs for each gene pair; , and are the corresponding weights of the three GO aspects. Though the absence of a criterion to quantify the weights of the different aspects of GO on gene's function, we let the weight be equal to the corresponding FS, based mainly two considerations. First, because genes function unequally in the three GO aspects, the one yielding greater similarity should have a greater weight. Second, a great reduction in the integrated FS can be avoided even though the gene pair receives a zero FS in some aspect. The final formula for the integrated FS is(7)where, also ranges between 0 and 1.

SoyFGNs construction

As shown in our previous work [18], our method yields more reliable gene FS for such species that has shallow gene annotations as soybean, somewhat resolving a critical problem in functional network construction. In doing so, we can calculate any pairwise FSs for a list of genes , and further get an similarity matrix , in which the element represent the functional similarity of the gene and . The next is to filter the matrix M to derive an adjacency matrix representing the functional gene network. The key to do this is to determine how similar in function must the two genes be to be linked in the network, i.e. appropriate threshold is needed to ensure that gene pairs with FSs greater than or equal to the threshold value will be connected by edges (); otherwise, they are not connected directly().

In this study, we adopted clustering coefficient-based threshold selection. The clustering coefficient () of a node () in a network is defined as , where represents the number of edges between first neighbours of a gene ; if , we define . The clustering coefficient of a network is defined as the average clustering coefficient of all of its nodes,(8)

where is the number of nodes in the network. If , we define .

The construction of a gene network can be viewed as a process in which links are removed from the initially complete graph by gradually increasing the FS threshold. Because all FSs range between 0 and 1, we set a series of incremental thresholds () with an increment of 0.01. For each threshold , we construct a network by set if . In systems biology, a genuine biological network should be scale-free and highly modular; its clustering coefficient, denoted by , should be significantly higher than that of the corresponding random network, denoted by . Here, we denote the difference between and by , i.e. . We conjectured that the most appropriate threshold should be the maximum , which can produce a monotonically increasing when the links are removed gradually as the threshold increases from 0 to . More specifically, we formulated this as a discrete optimisation problem, where the critical cut-off threshold was determined by finding the first , which lets over a set of gradually increasing from 0 to 1. Note that calculating of the randomise networks is nontrivial by formula (8) because it is not clear which random network model should be used for this purpose. Hence, we adopted a statistical method proposed by Elo et al. [26] for its solution. If denotes the total number of nodes and denotes the degree of a node for the original network, then is calculated as the expected value of the clustering coefficient as follows:(9)

where , and .

Finally, an FGN can be constructed and represented as , where represents the genes involved in the network, represents the edges between gene pairs with FSs greater than or equal to the threshold T, represents the weights of the edges, which are the FSs of pairwise genes.

Using the pairwise FS of all soybean genes and the clustering coefficient-based threshold selection, we construct four soybean functional gene networks (SoyFGNs) in BP, MF, CC and INT, respectively.

Topologic characterisation of SoyFGNs

One way to characterise biological networks is to study their topologic properties. We using Cytoscape 2.8.2 [27], investigated the global properties of the resulting SoyFGNs. In addition, we conducted an in-depth analysis of the degree distribution and degree correlation, as described in the next two subsections.

Degree distribution.

Many early studies observed that biological networks are generally scale free and their degree distribution follows the power law [28],[29]. A number of later studies have argued that there are other distributions, such as the log-normal distribution, which explain the degree distribution better than power law [30],[31]. We used three models to investigate the distributions of the four resultant FGNs: lognormal, power law and exponential. All model fittings and visualisations are completed with the use of Origin 9 (http://www.originlab.com).

Degree correlation.

Degree correlation is a basic structural metric for calculating the likelihood that nodes link to nodes of similar or dissimilar nodal degree. The former case is called positive degree correlation, and the latter is called negative degree correlation. In the social sciences, a network with positive degree correlation is referred to as an assortative network, whereas a network with negative degree correlation is referred to as disassortative network [32]. Three ways of characterising the amount of degree correlation are used, each involving less detail and expressing the result in more compact terms. They are the joint degree distribution (JDD), the k-nearest neighbours (knn) and the Pearson degree correlation (PDC).

The JDD is defined as the distribution in which each entry is the number of edges that the nodes at their endpoints have degrees i and j, respectively. JDD is actually a two-dimensional distribution of the number of edges with respect to the degree of their connected nodes.

Instead of recording every pair of nodes, as JDD does, knn simply averages the degrees of the neighbours of each node of a given degree and plots the results as linear, semi-log, and log-log plots. If a degree is missing, it is skipped in the graph. A rise in knn along with a rise in nodal degree indicates that nodes of similar degree tend to be linked, whereas a fall in knn with a rise in degree indicates the opposite.

PDC is the most condensed way to characterise the degree-link structure of a network. It consists of the conventional Pearson correlation calculation applied to each pair of linked nodes. The result always lies in the range [−1, 1], with a negative value indicating that nodes of dissimilar degree tend to be linked and a positive value indicating that nodes of similar degree tend to be linked.

Evaluating SoyFGNs through comparison to a network generated by orthology from Arabidopsis

A Soybean network generated by orthology from Arabidopsis.

An alternative approach to constructing a soybean gene network might be simple to transfer linkages from orthologous gene pairs of the existing gene network. This approach does not require modelling using soybean annotations or any of the soybean-derived experimental data. The value of this approach has been shown in reconstruction of gene networks for C. elegans [5] and Arabidopsis [7]. To assess the accuracy of the SoyFGNs in comparison to such an orthology-derived network, we first identified the orthologs between soybean and Arabidopsis using BLASTN, the results are shown in Table S1. We then downloaded the gene network of Arabidopsis from BioGRID (3.2.96) and infer soybean gene linkages based on linkages of this network, generating an orthology-derived soybean gene network, which consists of 16566 nodes (genes) and 146562 edges (linkages).

Inferring functional linkages from KEGG pathways and validating a query network.

To validate SoyFGNs using independent annotations, we employed the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database [33]. KEGG is based on manual curation and is thus considered generally accurate and largely independent from both SoyFGNs and the orthology-derived network. We downloaded equivalent link information for soybean genes from LinkDB (http://www.genome.jp/linkdb/) using UniprotKB AC/ID on March 2013. All links were also mapped to Ensembl Genomes IDs. As a result, 3145 genes were mapped to 238 pathways, which can be retrieved by our web database (http://nclab.hit.edu.cn/SoyFN/tar_pathway.php). As a benchmark network, the KEGG-derived network was constructed by generating linkages between genes sharing KEGG annotation terms, i.e. sharing the same KO IDs. The validation of a query network by KEGG-derived network is mainly based on the gene coverage and the linkage accuracy. The gene coverage () is defined as , where is the number of genes shared by the query network and the KEGG-derived network, is the number of genes involved in KEGG-derived network. The linkage accuracy () is defined as , where is the number of linkages between genes in the query network, is the number of linkages between genes in the KEGG-derived network.

Inferring functional linkages from co-expression data and validating a query network.

Another major source of functional associations is mRNA co-expression data. So we additionally inferred functional associations from mRNA co-expression profiles to evaluate SoyFGNs. 11 datasets for Glycine max genes was downloaded from the Gene Expression Omnibus (GEO) [34] on March 2013 (Table 1). In order to reduce the false positive rate, 4 datasets that have less than 20 samples each were discarded. The remaining 7 datasets were then filtered by removing the uninformative sets by testing for a significant correlation between the Pearson correlation coefficients (PCCs) between pairs of genes' expression vectors and removing the genes not sharing a specific Ensembl Genomes ID for further analysis. For each dataset, the PCC between pairs of genes' expression profile was used as the measure for inferring the co-expression linkages. The pairs of genes, between which the absolute value of PCC is more than 0.8, were linked. Finally, all linkages derived from 7 expression datasets were merged into a final co-expression network. The inclusiveness of a network versus co-expression network is also measured by the gene coverage () and the linkage accuracy ().The gene coverage () is defined as , where is the number of genes shared by the query network and the co-expression network, is the number of genes involved in co-expression network. The linkage accuracy () is defined as , where is the number of linkages between genes in the query network, is the number of linkages between genes in the co-expression network.

Download:

Table 1. Soybean mRNA expression datasets and the inferred functional linkages.

https://doi.org/10.1371/journal.pone.0113907.t001

For the reason that co-expression network is generated from an approach different from the one to generate GO, while the KEGG network is generated from the same approach to generate GO. It would generate a very low linkage accuracy. To evaluate the perhaps low linkage accuracies are statistically significantly higher than the background accuracy, we made an additional statistical analysis between the linkage accuracies of the original ontology-derived network and SoyFGNs and their corresponding randomized networks. A randomized network is generated by doing randomly perturbations to the edges, but maintaining the same nodes and their degree distributions. As our pre-experiments showed that more than 400 times perturbations could provide a stable-property randomized networks, we used the average of 400 randomized networks to evaluate the background linkage accuracy. The p-values are given to indicate their difference significances (by ANOVA).

Network-guided disease-resistant gene discovery

The aforementioned pathway and co-expression analysis showed that genes for similar biological processes or with similar expression profiles can be successfully associated in SoyFGNs. We next, as a case study, specifically tested the feasibility of predicting the genes governing plant disease resistance by using SoyFGN-INT in two steps: network-guided discovery and in silico verification.

Plant disease resistance protects plants from pathogens. Resistance genes (R-genes) are genes in plant genomes that convey plant disease resistance against pathogens by producing R-proteins. The main classes of R-genes consists of a nucleotide binding domain (NB) and a leucine rich repeat (LRR) domain(s) and are often referred to as (NB-LRR) R-genes. NB-LRR R-genes can be further subdivided into toll interleukin 1 receptor (TIR-NB-LRR) and coiled-coil (CC-NB-LRR) [35]. To implement this study, randomly selected 24 genes were used as query genes to predict R-genes using the Gaussian smoothing guilt-by-association method [36]. In order to evaluate the stability of prediction, 6 (1/4) of 24 query genes were putative R-genes, while others were experimentally verified R-genes (see Table S2). To be noted that, we here predicted candidate genes by only using the direct network neighbors via guilt-by-association. 225 of 737 candidate genes, which are highly connected with 14 query R-genes and constitute the biggest disease resistant module, were used for further analysis (see Table S3). For these 225 disease-resistant candidates, we first defined their functions by extensive databases and literatures searches. Second, we assigned a weighted rating (WR) score for each candidate according to its connected known R-genes to prioritize their possibilities to be R-genes. Obviously, WR of a gene should be positive proportional to both the number of its neighbor function-known R-genes and the average weight of edges link it to the neighbors, which were represented as functional similarity (FS). We used the so called ‘true Bayesian estimate' to compute such WR, which is a useful weighting mechanism used by the Internet Movie Database (IMDb) to adjust a movie's rating score based on the number of votes it has received. The formula is defined as:where F, the average FS of each gene; C, the total average FS of all genes; v, the number of neighbor function-known R-genes; m, a minimum number of neighbors to a R gene, which was set to be the first quartile of the neighbor number distribution of all 225 genes. The resulting WR scores of 225 genes are also provided in Table S3 (xls).

Results

Functional similarities of pairwise genes

Measuring the pairwise FSs of soybean genes is the first step of SoyFGNs construction. UniProt-GOA (http://www.ebi.ac.uk/GOA/), published in September 2012 (version 111), deposit 165,040 annotations, annotating 37,827 (∼70%) of the 54174 soybean genes in total (recorded by EnsemblPlants, release 18-April 2013). The numbers of genes annotated by BP, MF and CC are 27594, 33189 and 14150, respectively. Using our previously proposed SSDD [18], we obtained 380700621 (27594*27593/2), 550738266 (33189*33188/2), and 100104175 (14150*14149) pairwise FSs in BP, MF, and CC respectively. We then assigned each gene pair with an integrated FS using the weighted average of three FSs (see METHODS for details), producing 715422051 (37827*37826/2) pairwise FSs of 37827 genes, referred to as “Integration (INT)”. We excluded the FSs of the genes themselves because these will not be used for subsequent construction of the no-loop networks. The distribution of these four types of pairwise FSs is shown in Figure 2. The complete data are provided on our website (http://nclab.hit.edu.cn/SoyFN) because their sizes exceed the upper limit of supplementary files (each larger than 8 GB). All genes can be retrieved by the UniprotKB AC/ID or the Ensembl Genome ID on our website (e.g., K7MVA4 and GLYMA18G52145). Hereafter in this paper, we use the UniprotKB AC/ID to refer to the corresponding gene.

Download:

Figure 2. Figure 2. The distribution of pairwise functional similarities of soybean genes (dashed lines with marks) and the cumulative probabilities of distributions (solid lines with marks).

https://doi.org/10.1371/journal.pone.0113907.g002