Conceived and designed the experiments: HJA PCB. Performed the experiments: HJA. Analyzed the data: HJA. Contributed reagents/materials/analysis tools: JHM TF. Wrote the paper: HJA PCB.
The authors have declared that no competing interests exist.
The dramatic increase in heterogeneous types of biological data—in particular, the abundance of new protein sequences—requires fast and user-friendly methods for organizing this information in a way that enables functional inference. The most widely used strategy to link sequence or structure to function, homology-based function prediction, relies on the fundamental assumption that sequence or structural similarity implies functional similarity. New tools that extend this approach are still urgently needed to associate sequence data with biological information in ways that accommodate the real complexity of the problem, while being accessible to experimental as well as computational biologists. To address this, we have examined the application of sequence similarity networks for visualizing functional trends across protein superfamilies from the context of sequence similarity. Using three large groups of homologous proteins of varying types of structural and functional diversity—GPCRs and kinases from humans, and the crotonase superfamily of enzymes—we show that overlaying networks with orthogonal information is a powerful approach for observing functional themes and revealing outliers. In comparison to other primary methods, networks provide both a good representation of group-wise sequence similarity relationships and a strong visual and quantitative correlation with phylogenetic trees, while enabling analysis and visualization of much larger sets of sequences than trees or multiple sequence alignments can easily accommodate. We also define important limitations and caveats in the application of these networks. As a broadly accessible and effective tool for the exploration of protein superfamilies, sequence similarity networks show great potential for generating testable hypotheses about protein structure-function relationships.
Over the past two decades, there has been a disorderly explosion of biological data, exponentially increasing in volume with time. To keep pace with the broad classes of new sequence, structural, and functional data arising from compilations of genomic and proteomic data in particular, many powerful approaches have been developed for unearthing meaningful themes and hypotheses from within the jumble. Yet there is still a critical need for improved techniques enabling fast and comprehensive analysis of large sequence data sets, especially to access the biologically useful context that can be extracted from this information. There is a particular demand for easy-to-use techniques to aid experimental biologists in finding useful starting points for analyzing diverse superfamilies of proteins. Here we address one of these techniques, sequence similarity networks (
A. Thresholded sequence similarity networks represent sequences as nodes (circles) and all pairwise sequence relationships (alignments) better than a threshold as edges (lines). The same network, depicting three simulated protein classes, is shown here at four different thresholds. At stringent thresholds, the sequences break up into disconnected groups; within each group the sequences are highly similar. The relative positioning of disconnected groups has no meaning, while the lengths of connecting edges tend to correlate with the relative dissimilarities of each pair of sequences. As the threshold is relaxed and edges associated with less significant relationships are added to the network, groups merge together and eventually become completely interconnected. B. Simulated dendrogram for a sequence set that might give rise to the network in A.
There has already been a great deal of interest in generating sequence similarity networks. Enright and colleagues recognized that visualizing a network of protein similarity information
But before sequence similarity networks can be adopted for broad use, it is important to understand their strengths and weaknesses. In particular, these types of networks need to be validated in comparison to better-understood approaches. A primary motivation of this work is to address whether there is a compelling quantitative argument that sequence similarity networks can competently depict sequence similarity relationships, allowing them to be used as a framework to guide hypotheses about functional relationships. Although it has long been recognized that sequence similarity is an imperfect proxy for functional similarity, a fundamental dogma of structural biology—that sequence conservation infers structural conservation, which in turn implies functional conservation—has been extensively and effectively applied to infer functional properties on every scale. Consistent with this view, our results demonstrate that visualized sequence similarity networks perform well in representing sequence similarity information, and indeed the visualized relationships correlate well with known functional relationships. In contrast to the formal network representations of sequence similarity represented by previous studies describing algorithms for network generation, we have shown how well the displayed relationships reflect various measures of sequence and evolutionary distance, using relevant examples and quantitative assessments. Additionally, we introduce a concept: the most valuable feature of sequence similarity networks is not the optimal or most accurate display of sequence similarity, but rather the flexible visualization of many alternate protein attributes for all or nearly all sequences in a superfamily. To illustrate the results, we have used three well-studied superfamilies with nuanced functional annotations. This work is especially applicable to the study of individual superfamilies, and is complementary to previous work in this area that typically shows that networks can group all known proteins in agreement with broad definitions of functional similarity (e.g.
Here we demonstrate, using example data sets of G-protein coupled receptors (GPCRs), kinases, and the crotonase superfamily of enzymes, that sequence similarity networks recapitulate much of the information present in phylogenetic trees, that the relationships implied by networks are in agreement with known sequence and structural relationships, that networks incorporate a number of practical benefits that improve on current techniques for relating sequences, and finally, that visualization of similarity networks enables the perception of trends from the context of sequence similarity, initiating fruitful hypotheses. Finally, we report a new result relevant to the evolution of domain variation in the crotonase superfamily of enzymes that was obtained from analysis of sequence similarity networks.
Our results provide validation of sequence similarity networks for establishing family or superfamily context and for illustrating important applications. The first two sections provide quantitative evidence to support our claim that two-dimensional distances in visualized networks correlate well with the underlying distances in high-dimensional space and with distances depicted by phylogenetic trees, indicating that the depictions are mathematically reasonable and comparable to an accepted standard. The next sections address the practical benefits we have found for sequence similarity networks in capturing known (and novel) sequence and structural relationships, and in providing different and new information compared to conventional methods for relating sequences. We also describe some of the important advantages this view of sequence similarity context provides for hypothesis generation about structure-function relationships. This latter application is most powerful when nodes in the network are painted with structural or functional information that is orthogonal to homology-based information. An example is provided by mapping sequence length and taxonomic information onto the crotonase superfamily network, leading to the discovery that there are three major groups within the superfamily that are differentiated by domain organization and that track with primary branching across the tree of life. Each section is accompanied by a brief discussion of the controls and caveats we have found to be important for effective use of this method.
Graph layout algorithms project the N-1 dimensional data structure into two (or three) dimensions for visualization, with the aim being to preserve, as well as possible, the actual pairwise distances between nodes in high dimensional space. In this case, the graphs are made up of nodes (sequences) connected by edges (pairwise similarity relationships). The layout used in this work, the Organic layout
Additionally, we found high correlations between a Class A GPCR network composed of 605 sequences and networks from this set where 20% of the sequences were removed at random. To address the impact of missing data on network topology, we compared the laid-out distances between sequences present in the full network and these 80% networks (
We examined the similarity relationships implied by phylogenetic trees and networks of two small protein families (amine-binding GPCRs, and the STE and WNK kinases) and the kinase superfamily. Both sequence families are simple to align—highly conserved transmembrane helix domains anchor the amine-binding GPCRs, while the STE and WNK kinases have an average percent identity of 36% across the alignment. The distances between sequences in a neighbor-joining tree of the 42 human amine-binding GPCRs and the corresponding sequence similarity networks are well correlated (R = 0.712; see
A. Neighbor-Joining tree describing the interrelationships of 42 amine-binding human GPCR domains. Sequences are labeled according to the common name for their class (e.g., the sequence labeled α1D is adrenoceptor α1D; see additional data file 5 for all sequence database identifiers). B. Sequence similarity network including the same 42 sequences as in (A). This network was thresholded at a BLAST E-value of 1×10−33: only edges associated with E-values more significant than 1×10−33 are included in the network. This network contains 324 edges; the worst edges displayed correspond to a median of 30% identity over an alignment length of 280 amino acids. See Table I for a quantitative comparison of the two representations. The sequences labeled (a) and (b) are discussed in the text.
A. BLAST E-values (from pairwise alignments) | A. BLAST E-values | ||
B. Organic layout | R: 0.906±0.034 | ||
Z: 11.87 | |||
P: 8.04×10−33 | B. Organic layout | ||
C. Neighbor Joining tree | R: 0.758±0.034 | R: 0.712±0.034 | |
Z: 9.91 | Z: 9.43 | ||
P: 1.95×10−23 | P: 2.14×10−21 | C. NJ tree | |
D. Distances from multiple sequence alignment | R: 0.715±0.034 | R: 0.645±0.034 | R: 0.944±0.034 |
Z: 9.11 | Z: 8.24 | Z: 13.07 | |
P: 4.14×10−20 | P: 8.47×10−17 | P: 2.29×10−39 |
Pearson's correlations (R) and associated Z-scores (Z) and P-values (P) describing the similarity between the relative pairwise distances between 42 amine-binding GPCR domain sequences as assessed by (A) all shortest paths between −log10(BLAST E-values), (B) the shortest paths between sequences as displayed by a two-dimensional graph layout algorithm, (C) the shortest paths between sequences in a Neighbor-Joining tree, and (D) the relative pairwise distances calculated from a multiple sequence alignment. Additionally, pairwise BLAST E-values and the graph layout algorithm correspond to a network thresholded at an E-value of 1×10−33. Note that the network layout (B) is a visual representation of the underlying distances in (A), while the tree (C) is a visual representation of the underlying distances in (D). A and D cannot be visualized exactly in fewer than N-1 dimensions.
In order to assess the correspondence between a very large phylogenetic tree and sequence similarity networks, we used a dendrogram of the human kinome
Two ways of coloring the same network of 513 human kinase domains are shown. The network is thresholded at a BLAST E-value of 1×10−25. The worst edges displayed correspond to a median of 29% identity over alignments of 260 residues. A. Network colored by kinase class. B. Network colored by the presence of a catalytic Lys in the “VAIK” motif: Each of the 513 sequences was aligned to a sequence model of the kinase domain, and the identity of the residue at the catalytic Lys position is mapped to the network. *Note that MAP2K1 and MAP2K2 registered a Lys to Arg substitution due to a sequence alignment error. The other labeled kinases truly do not contain a homologous catalytic K, but only the WNK kinases have been shown to have kinase activity. See Table II for statistics.
A. BLAST E-values | A. BLAST E-values | |
B. Organic layout | R: 0.934±0.003 | |
Z: 41.2 | ||
P: 0.0 | B. Organic layout | |
C. Manning et al. 2002 human kinome tree | R: 0.683±0.003 | R: 0.628±0.003 |
Z: 39.5 | Z: 40.0 | |
P: 0.0 | P: 0.0 |
Pearson's correlations (R) and associated Z-scores (Z) and P-values (P) describing the similarity between the relative pairwise distances between 419 human kinase domain sequences in common as assessed by (A) all shortest paths between −log10(BLAST E-values), (B) the shortest paths between sequences as displayed by a two-dimensional graph layout algorithm, and (C) the shortest paths between sequences in the phylogenetic tree published in Manning et al. 2002
Note that while there are many similarities between the interpretations that can be made from the information provided in a network and a tree, phylogenetic trees are based on an explicit evolutionary model that is missing from sequence similarity networks. Thus, networks are not an adequate alternative to a tree, as the interrelationships they depict cannot be used as a basis for inferring evolutionary history. Indeed, there is a fundamental difference between the network composed of nodes representing contemporary protein sequences that may be connected with cycles, and the acyclic Steiner tree with introduced ancestral nodes that can be used to describe a phylogenetic tree. Despite this, and particularly in the case of large networks with many edges, we have found anecdotally that the composition of many independent alignments as a graph projected into two dimensions enables a visual estimate of confidence in a displayed group-wise similarity relationship—a single edge representing a pairwise alignment at 22% identity may look like noise, but a large number of edges representing slightly different 22% identity alignments between different members of the same two discrete groups can be more convincing, particularly when there are known structural and functional relationships between the groups, as in the GPCR networks depicted in
A. 605 human Class A: Rhodopsin-like GPCR domains. This sequence set includes the 42 amine-binding sequences from Table II and
The structural relationships between different functional classes of GPCRs can be extremely distant. At the low stringency threshold at which inter-group relationships can be visualized using networks, many of the displayed edges represent poor alignments. In
One important application of sequence similarity networks is using them to form general functional hypotheses for sequences whose molecular functions are unknown. A typical protein superfamily sequence set contains a number of well-known families or characterized groups, as well as other groups that can be confidently classified to the superfamily but which are uncharacterized or for which the evidence for annotation with a more specific family label does not exist. In
Another feature accessible from the network representation is so basic that it is easy to overlook—networks enable the conversion of lists of labeled protein sequences to a visually intuitive display of the entire data set. Thus, even given the caveats, the network shown in
Not only do sequence similarity networks retain the basic clustering and topology information present in phylogenetic trees, but they may also be a better representation—for the purposes of developing hypotheses about protein family sequence and structural interrelationships—than phylogenetic trees. Whereas a phylogenetic tree requires the complexity of all of the pairwise relationships in a multiple sequence alignment to be projected down into one dimension, a sequence similarity network can show multiple neighbors for a given sequence. In so doing, the network can reveal sequences that may have sequence characteristics useful for linking divergent clusters in multiple alignments.
Additionally, it is not necessarily appropriate to include a sequence in a multiple sequence alignment that is firmly in the twilight zone of sequence similarity relative to most of the other sequences in the alignment
The context provided by the similarity network can be exploited in many ways. For example, the kinase networks shown in
In the course of considering the effect on network topology from using full-length sequences or only single domains, new groupings for the enoyl-CoA hydratase family were revealed, based on changes in domain architecture. (The enoyl-CoA hydratase family (ECH) is the constituent family for which the larger ECH superfamily was named.) Most proteins are composed of two or more domains, and the combination of multiple domains may modify the function of a multidomain protein relative to its single domain homologue
The displayed networks all describe the pairwise relationships between 1,170 sequences from the crotonase superfamily. A. Network colored by family annotation, involving full-length sequences, thresholded at an E-value of 1×10−30. The worst edges displayed correspond to a median of 33% identity over alignments of 250 residues. B. The full-length network from A with nodes colored by sequence length and edges colored by alignment length. The same bifunctional enoyl-CoA hydratases (bECH) are marked with a dashed oval in B and C. C. Network colored by family annotation, involving just the crotonase domain, thresholded at 1×10−29. The worst edges displayed correspond to a median of 38% identity over alignments of 180 residues. D. 17 selected edges from the network in A and B. In the left panel, for each pair of sequences participating in an alignment, the log E-value versus the HMM used to define the crotonase domain is shown for each sequence—the single domain ECH (sECH) is on the bottom, and the second member of the pair is on the top—and the log BLAST E-value for the alignment between the two is in the middle. Two example bECH and sECH sequences (not alignments) are shown at the bottom of the left and middle panels. In the middle panel, each amino acid in each sequence from the 17 alignments is colored according to whether it was aligned to the crotonase domain defined by the HMM, and/or was paired to the other sequence in the BLAST alignment used to define an edge. Locations of six of these edges are marked in the enlarged view of the network in A in the right panel. The locations of the example bECH and sECH sequences are marked in the right panel using stars. See
While network topology is not strongly affected by sequence similarity outside the domain of interest in the ECH and GPCR superfamilies, this may not be the case with all superfamilies. In practice, we have found that better resolution can be achieved using networks of full-length sequences, as the greater variation in lengths of alignment and corresponding similarity scores allows a more nuanced discrimination between different groups of proteins. Yet this comes at a risk of including relationships that can be mistakenly attributed to the domain of interest. If an additional domain in common happens to be more conserved than the domain of interest, unexpected edges will link groups that the investigator would expect to find distant from one another. Investigators should weigh these issues and consider their familiarity with the superfamily before interpreting a full-length sequence network in the absence of a comparable single domain network. A useful control we use is to generate networks of each domain in a multidomain set and contrast the results with the network for the full-length proteins. Here, mapping lengths onto the network visualization clearly indicates the existence of domain differences in the ECH family (
Exploration of the domain differences in the enoyl-coA hydratases—by mapping species categories onto the network—leads to new observations that have not previously been reported. We discern three major groups of ECHs: bifunctional two-domain proteins (including an ECH domain) found in bacteria, metazoans, and plants; these are variously known as multifunctional enzyme MFE-1, peroxisomal bifunctional enzyme, and the alpha subunit of mitochondrial trifunctional protein
The displayed networks contain all 410 enoyl-CoA hydratases from the crotonase superfamily network in
We expect that the use of sequence similarity networks may soon become as common in laboratories as the use of multiple sequence alignments. As shown here, these networks can be used to display distances that are accurate from a mathematical perspective, as well as comparing favorably to an accepted method for establishing molecular similarity, the phylogenetic tree. Sequence similarity networks reiterate known structural and functional relationships, and can be used to analyze very large data sets in a timely manner, allowing many different networks to be explored in the time required to generate a single phylogenetic tree of reasonable quality. However, we see the real promise of this technique as allowing a knowledgeable scientist to observe basic connections and clustering in a protein superfamily of interest in the context of orthogonal information. Thus, a good framework for visualizing networks performs well in recapitulating known group-wise connections and clustering. More critically, it should provide a clear view of all of the proteins in the dataset, and flexibility in mapping different features to the visual display so that large-scale and group-wise trends as well as outlier status can be discerned-the particular network layout algorithm used is not important as long as it adequately represents similarity; there are many ways a layout algorithm can be optimized to correspond more closely to some numerical ideal. Networks can be generated from protein distance data derived from many types of analyses, but for simplicity and because of the advantages of speed and the ability to use very large sets of proteins, we have used BLAST in this paper. Moreover, clustering of proteins can also be obtained in many ways. In this paper, we have used a simple method to underline the value of protein similarity networks when tagged with functional information. While we argue that coming to a final conclusion based on a pairwise BLAST alignment is generally not supportable, visualization of sequence similarity networks provides—using even such a simple metric as BLAST—an environment for exploring complex protein data sets and the straightforward generation of hypotheses to be tested using more rigorous methods. The developers of Cytoscape are actively working on extending the application to facilitate analysis of sequence similarity networks
The human GPCR sequences and ligand-based annotations were extracted from the GPCR NaVa Database
The kinase sequences and annotations were drawn from the base set of 621 human kinase domains in Kinbase (available at
The crotonase superfamily sequences and annotations came from the 1,330 publicly available sequences in this superfamily in the Structure-Function Linkage Database
GPCRs: To remove duplicate and highly similar sequences, the 773 GPCR sequences were winnowed to 766 by filtering to a maximum of 99% identity using cd-hit
Kinases: Beginning with the 621 human kinase domains, all sequences labeled as pseudogenes were removed, leaving 517 domains. The 517 domains were then filtered to a maximum of 99% identity as described above, leaving 513 sequences.
Crotonases: The initial 1,330 crotonase superfamily sequences were filtered to a maximum of 99% identity as described above, leaving 1,170 sequences. In order to define a general crotonase domain, the best-resolution structure from each applicable SFLD crotonase family
The sequence similarity networks consist of a collection of edges corresponding to pairwise relationships that are better than a defined threshold. For this work, pairwise relationships correspond to BLAST alignments associated with an E-value
Additionally, BLAST E-values and scores are not symmetric—for a given comparison between two sequences, the alignment, score, and E-value can vary depending on which sequence is used as the query. In tests we performed to adjudicate this issue, we found that 74% of the comparisons in a large network have “backward” and “forward” E-values within 5 log units—regarding the other 26%, the median average log E-values begin at −46.5 and decrease as the score asymmetry increases; for our data set, alignments corresponding to log E-values of −46.5 had a median percent identity of 35% over 290 amino acids (see
To aid in evaluating the networks, we create quartile plots of alignment percent identity, alignment length, and edge count versus edges binned by associated E-value (see
Sequence similarity networks in this work are visualized using the Organic layout
The amine-binding GPCR tree was constructed from all 42 sequences in the “Amine” class (a subclass within the Class A GPCRs, which are themselves a subclass within the 766 human GPCR domain data set). The 51-sequence kinase tree included each sequence from the 513 human kinase domain sequence set that was annotated as an STE or WNK class kinase. Both trees were constructed using the same protocol: The sequences were aligned with MUSCLE
The ECH trees (
All trees were visualized in Dendroscope
The central quantitative analysis in this work is the direct comparison of pairwise distance matrices between N-1 dimensional BLAST networks, two-dimensional displayed distances calculated by the Cytoscape 2.6
The approach for comparing the above distance matrices and calculating the significance of their correlations is taken directly from Goh et al. 2000
To evaluate how much sequence similarity networks change when some sequences are left out of the network, we removed 20% of the sequences at random from the Class A GPCR sequence set, and calculated Pearson's correlation between corresponding displayed distances based on the full 605-sequence set versus the 80% (484 sequences) set, as well as the underlying BLAST E-values. (The same Class A GPCR sequences are featured in Results Section III.) We used an E-value threshold of 1×10−11 to define the network. Derived statistics are based on ten replicates.
Each of the 513 human kinase domain sequences was aligned to either the PFAM Pkinase or Pkinase_Tyr family HMM
NCBI maintains a hierarchical taxonomy database
All data files generated in the analysis, including sequence files, HMMs, and network files, are available online at
Summary of network statistics: Correlation between organic laid-out network distances and the mathematically ideal BLAST E-value distances
(0.04 MB DOC)
Comparison of mathematically ideal and displayed pairwise network distances between 51 human STE and WNK kinases
(0.05 MB DOC)
Comparison of mathematically ideal and displayed pairwise distances between networks of the crotonase superfamily, using either full-length sequences or just the crotonase domain
(0.03 MB DOC)
Network distances are similar between the Organic and Cytoscape force-directed layout weighted by E-value
(2.13 MB PDF)
Comparison of network layout and clustering with BLASTCLUST
(3.85 MB PDF)
Graphic showing how network topology is affected by missing data. The correlation is high between the topology of the Class A GPCR network and networks with 20% of the sequences removed at random.
(0.75 MB PDF)
Comparison of trees and networks: STE and WNK kinases
(0.38 MB PDF)
The archaeal bECH ECH domain is more similar to the sECH domain than the non-archaeal bECH ECH domain
(1.11 MB PDF)
Asymmetery in BLAST E-values: How large is the difference between the E-values calculated between sequence pair A,B when A is used as query, or B is used as query?
(0.43 MB PDF)
Example percent identity and length of alignment quartile plots
(0.22 MB PDF)
We thank Shoshana Brown for assistance with the SFLD crotonase dataset, and Lindsay Reynolds for inspiring some of this analysis.