Open Access
Research Article
Stylus: A System for Evolutionary Experimentation Based on a Protein/Proteome Model with Non-Arbitrary Functional Constraints
Biologic Institute, Redmond, Washington, United States of America
Abstract
The study of protein evolution is complicated by the vast size of protein sequence space, the huge number of possible protein folds, and the extraordinary complexity of the causal relationships between protein sequence, structure, and function. Much simpler model constructs may therefore provide an attractive complement to experimental studies in this area. Lattice models, which have long been useful in studies of protein folding, have found increasing use here. However, while these models incorporate actual sequences and structures (albeit non-biological ones), they incorporate no actual functions—relying instead on largely arbitrary structural criteria as a proxy for function. In view of the central importance of function to evolution, and the impossibility of incorporating real functional constraints without real function, it is important that protein-like models be developed around real structure–function relationships. Here we describe such a model and introduce open-source software that implements it. The model is based on the structure–function relationship in written language, where structures are two-dimensional ink paths and functions are the meanings that result when these paths form legible characters. To capture something like the hierarchical complexity of protein structure, we use the traditional characters of Chinese origin. Twenty coplanar vectors, encoded by base triplets, act like amino acids in building the character forms. This vector-world model captures many aspects of real proteins, including life-size sequences, a life-size structural repertoire, a realistic genetic code, secondary, tertiary, and quaternary structure, structural domains and motifs, operon-like genetic structures, and layered functional complexity up to a level resembling bacterial genomes and proteomes. Stylus is a full-featured implementation of the vector world for Unix systems. To demonstrate the utility of Stylus, we generated a sample set of homologous vector proteins by evolving successive lines from a single starting gene. These homologues show sequence and structure divergence resembling those of natural homologues in many respects, suggesting that the system may be sufficiently life-like for informative comparison to biology.
Citation: Axe DD, Dixon BW, Lu P (2008) Stylus: A System for Evolutionary Experimentation Based on a Protein/Proteome Model with Non-Arbitrary Functional Constraints. PLoS ONE 3(6): e2246. doi:10.1371/journal.pone.0002246
Editor: Konrad Scheffler, University of Cape Town, South Africa
Received: January 9, 2008; Accepted: April 15, 2008; Published: June 4, 2008
Copyright: © 2008 Axe et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The work was paid for by general research funds of Biologic Institute.
Competing interests: The authors have declared that no competing interests exist.
* E-mail: daxe@biologicinstitute.org
Introduction
Because of their simplicity, lattice polymer models (where structures consist of chains of connected beads occupying neighboring positions on a two- or three-dimensional lattice) have become attractive artificial systems for studying certain general properties of structure-forming polymers. The study of protein evolution, particularly the origin of protein folds, is one challenging area where lattice models have been employed [1]–[3]. Although these model constructs are unrealistic in many respects, they do provide computationally tractable sequence spaces that can be mapped onto structure spaces with specified mapping rules. As such, they form a class of systems that can be studied in their own right, providing insights that (with due care) will continue to advance our understanding of real biological problems [1].
One such insight is that protein-like models (in contrast to RNA models) tend to show sparse connectivity between regions of sequence space that encode different structures [1]. In other words, stepwise paths through sequence space that accomplish a structural transformation without passing through unstructured intermediates appear to be rare. This clearly fits expectations for real proteins, where reorganization of core structure would seem to require complete loss of structure (and therefore function) along the way [4]. It also fits experimental observations, which show that the expected deterioration is common not only for transitions between different folds [5] but also, more surprisingly, for transitions between different sequences encoding the same fold [6].
What it fails to fit well, at first glance anyway, is the pattern of structural similarities evident in natural proteins. If there is a substantial probabilistic barrier to structural innovation in the protein world, then we might expect the evolutionary process to make do without it. By this view, the protein world ought to consist of one structural archetype put to many different uses, each involving modest alteration of peripheral structure but no major reorganization of the fold. Subsets of the natural proteins show precisely this, but the whole picture is strikingly different. Here we find a surprising preponderance of “orphan” folds—folds that each occupy their own patch of structure space, well removed from everything else [7]. Although models have so far failed to explain how orphan folds can be so common [3], they have offered explanations for substantial structural radiation.
However, the models purporting to explain structural radiation generally use simplistic representations of selectable function. As Zeldovich et al. point out, many evolutionary models lack any causal connection at all between sequence and function [8]. But even when causal models are used, they tend to be simplistic. Hirst has discussed the various aspects of structural soundness (e.g., folding stability or speed) that are singled out as proxies for selectable function [9]. Recognizing the distinction between structural soundness and functional utility, he required lattice structures to form a pocket (analogous to an active-site cleft) in order to be deemed functional [9]. This was certainly a step in the right direction, but the underlying problem remains: While these properties are all necessary for the function of real proteins, they are not sufficient. If they were, one good structure would suffice, whereas in reality we see not only a great variety of structures but also a strong connection between this variety and the great variety of specific functions they perform.
Oversimplification of function tends to obscure this fundamental connection. As an example, consider the recent lattice study of Zeldovich et al., which ties a genome's fitness to the lowest stability of its encoded proteins [8]. Their model enables a population carrying the gene for a single lattice structure to diversify to the point where evolved structures span the entire space of possibilities. But it achieves this not only by using stability as a proxy for function, but also by dispensing with the notion of a stability threshold—a minimal stability, below which structures are deemed non-functional [8]. In the end, structure space is freely explored here because it is entropically favorable for it to be explored, making structural variety an entropic artifact rather than a functional necessity. Because one good structure really does suffice in such a world, it seems unlike the real world, where “the great functional capacity and importance of proteins largely stems from the remarkable ability of these polymers to adopt distinct 3-dimensional structures” [3].
Can a new model be framed so as to capture this fundamental aspect of biology? A key step in this direction may be to base it on real function rather than a definitional substitute for function. Because real functions involve both specificity and real constraints, this would guarantee a level of functional realism that is not otherwise easily achieved. This principle is demonstrated by artificial-life simulations, like Avida [10], where computational tasks must be performed in order to gain a selective advantage. But because these tasks are performed by instructions rather than structures, Avida does not readily lend itself to protein studies.
Despite their limitations, though, all of the models discussed have strengths to offer. Furthermore, these strengths suggest a way to overcome the primary limitations. In particular, a model that ties real functions to polymer-like structures would have the potential to achieve a new level of biological realism. By incorporating real, specific functions it would be grounded in real functional constraints, and by basing these functions on polymer-like structures it would have a clear connection to real proteins. Here we describe such a model and introduce an open-source computational system that implements it, providing a complete environment for evolutionary experimentation on model genes that resemble bacterial genes.
Results
Model
Core Analogy.
Human language shares several interesting properties with biology. Both use complex structures to perform complex functions, the complexity in both cases being hierarchical—high-level functions and structures being built from those on a succession of lower levels. And while they clearly operate within functional constraints, neither has the highly rigid structure of formal systems like computing languages. Rather, they are characterized by an abundance of rules, virtually none of which is absolute. Add to this the fact that both biology and language have been shaped by real populations with real complex histories, and the similarities are seen to be quite extensive.
A more specific analogy between proteins and written language has long been recognized [11]–[13]. A common approach here is to compare alphabetic strings to amino-acid chains, the first having the capacity for linguistic meaning and the second for biological function. But despite the obvious appeal of this comparison, important dissimilarities exist. Perhaps the most striking of these is seen in the very different effects of cumulative sequence change. When protein-coding genes are subjected to occasional mutations over long time periods, they manage to undergo substantial sequence change while maintaining their original function. Alphabetic sequences, on the other hand, are rapidly degraded by typographic substitutions, leading eventually to complete loss of function.
This dissimilarity appears to stem from a difference in the underlying causal relationships. In the protein world, functions are a direct consequence of physical structures. This, in combination with a highly many-to-one mapping of protein sequences to structures, allows sequences to change continually while meeting the structural constraints imposed by the original function (a phenomenon known as neutral drift [14]). In contrast, alphabetic strings function as raw sequences, with no physical structure mediating between them and their function (they are, of course, recorded and conveyed through physical media, but the only requirement for achieving this is accurate representation of sequence). Although alphabetic sequences show a many-to-one mapping to function, it is “many” in a sparse and highly discontinuous sense. Because these sequences are directly constrained by the rules of linguistic function (grammar, vocabulary and spelling) they cannot withstand the continual step-wise change seen in proteins.
The importance of structure in the protein world suggests that a structure-based system of writing would provide a better analogy. Many of the Asian languages use non-alphabetic writing based on the Chinese characters that became standardized during the Han Dynasty (206 BC–220 AD). Like alphabetic letters, the Han characters are recognized by their distinct structural forms. But unlike letters, the characters have word-like meanings as stand alone entities. In these written languages, then, basic linguistic meaning is rooted in structure rather than sequence. This suggests a new way of framing the linguistic analogy to proteins. Instead of viewing the letters in alphabetic strings as being analogous to the amino-acid residues in a protein chain, the new approach views the Han characters as being analogous to whole protein folds (Figure 1).
Figure 1. Structural analogy between Han characters and protein folds.
This two-part character (identified by its hexadecimal Unicode number, U+8C58) is reminiscent of two-part protein folds like the one shown (PDB 1CQD).
doi:10.1371/journal.pone.0002246.g001This structural connection carries a number of other similarities with it. Some of these will be mentioned briefly here, with more detailed discussion to follow. First, while the previous analogy provided approximate correspondence between the number of alphabetic letters and the number of amino acids, the new one provides approximate correspondence between the number of Han characters and the number of distinct protein folds or functions in the biosphere. The standard enzyme classification scheme, for example, covers just over four thousand known enzyme functions (http://www.enzyme-database.org/stats.php) which depend upon a few thousand family-level structures (http://scop.mrc-lmb.cam.ac.uk/scop/count.html#scop-1.71). By way of comparison, the Unihan database (http://www.unicode.org/charts/unihan.html) indicates that roughly five thousand Han characters find use in a single language (based on the number of characters with kFrequency tags, indicating use in traditional Chinese USENET postings; http://www.unicode.org/Public/UNIDATA/Unihan.html). Visual discrimination of so many characters requires structural complexity beyond that of alphabetic characters, approaching the complexity of protein folds in some respects. Figure 1 illustrates the rough similarity in the number of parts (the line or curve segments that form strokes compared to the elements of secondary structure) that compose whole characters and whole proteins. Finally, both worlds exhibit hierarchical structure, meaning that complex forms are built from successively simpler forms (see Figure 2), most of which find extensive reuse in a variety of combinations for a variety of functional ends.
Figure 2. Hierarchical structure of Han characters.
Single strokes, like that shown at the bottom, are combined to form successively more complex structures (shown as ascending layers). Characters range in complexity from a single stroke to dozens of strokes.
doi:10.1371/journal.pone.0002246.g002Building on the Analogy.
The new model is based on the real relationship between structure and function exhibited by the Han characters. These characters are not intrinsically polymer-like, but since they are written by moving a pen tip along a path, the extension to a polymer chain model is straightforward.
Genetic Code. Although the process of writing involves three-dimensional paths, written forms are more like two-dimensional paths (part inked and part invisible). Because of this, we have restricted our model to two dimensions. The geometric simplicity of two-dimensional paths calls for a very simple suite of monomeric building blocks. For this purpose we use twenty coplanar vectors of three possible lengths, aligning with the eight compass directions (Figure 3A). The numerical equivalence to the set of protein-forming amino acids allows a genetic code to be defined for the vector world, whereby vector sequences are encoded by base triplets in much the same way that genes encode amino-acid sequences (Figure 3B).
Figure 3. Monomers and genetic code for construction of model proteins.
A) The set of vector monomers, named according to compass direction and length (i.e., Nem indicating a northeast vector of medium length). To ensure that vector addition produces different results with different vector combinations, small vectors are of length 1, medium vectors of length e1/2 (≈1.65), and long vectors of length e (≈2.72). B) A standard genetic code for specifying the monomers with nucleotide triplets. Like the natural code [15] this code incorporates several features that reduce the impact of point mutations. These include extensive use of third-position degeneracy, strong correlation of second position with a key physical property (direction), and underrepresentation of vectors that would be most disruptive as substitutes (long vectors).
doi:10.1371/journal.pone.0002246.g003Genes. The artificial genes used in the vector world look just like textual representations of bacterial genes: they begin with an ATG start codon, proceed through any number of vector-encoding codons (the same 61 sense codons used in biology), and terminate with one of the three biological termination codons (TAA, TAG, or TGA).
Primary structure: Vector sequences analogous to amino-acid sequences. Just as real protein chains are built by addition of amino acids at the C-terminus, so vector proteins are built by joining the tail of the newest vector to the head of the previous one. In both worlds the gene product is a chain of linked monomers, each internal monomer having one point where it was added to the growing chain and another where the next addition was made (Figure 4A,B).
Figure 4. Parallels between vector-world and real-world protein synthesis.
Steps are illustrated for a vector protein (U+8C58) on the left, with analogous aspects of a real protein (PDB 1CQD) on the right. A) Codons in an open reading frame specify monomers (vectors or amino acids) that may form regular local structure (green) or irregular local structure (grey). In the vector world a simple rule determines which is the case: A vector becomes part of regular structure if and only if it forms a coherent vector triplet (indicated by green tiles below the sequence; see text). B) Vectors are joined to form paths with head and tail termini, just as amino acids are joined to form chains with amino and carboxyl termini (right panel derived from public domain images by Yassine Mrabet). C) Vector proteins consist of strokes (formed by runs of coherent vectors) joined by moves (formed by runs of incoherent vectors), in much the same way that real proteins consist of units of secondary structure joined by turns or loops. D) Final working forms, highlighting the segments shown above.
doi:10.1371/journal.pone.0002246.g004Secondary Structure: Coherent path segments analogous to regular structure. Folded protein chains consist of segments with regular backbone structure (primarily α helix or β strand conformations) connected either by turns or by segments with irregular structure (loops). For a vector protein to form a written character, it must likewise consist of segments of two types: those forming strokes, and those forming moves between strokes. The rule used to differentiate these resembles the distinction between regular and irregular backbone structure in proteins, in that both depend only on local chain conformation. In the protein world, secondary structure is indicated by a succession of residues with dihedral angles characteristic of either α helices or β strands. As shown in Figure 4, whenever three consecutive vectors in a vector protein have directions spanning an angle of 90 degrees or less (meaning the compass directions lie within a quarter of the circle) they are shown as visible line segments in the working form of the protein (i.e., the ready-to-read representation, as in Figure 4D–left). Portions of a vector protein that do not meet this condition are not shown in the working form, thereby allowing drawn strokes to be joined by undrawn moves between strokes. Because the 90-degree condition amounts to a test of local directional coherence, the terms coherent and incoherent are applied to vectors that pass or fail this condition, respectively.
Dimensionality: Layered 2D analogous to 3D. Real protein structures are three dimensional, whereas the vectors used to build vector proteins lie within a single plane. Still, clear visualization of the constituent vectors in a vector protein typically calls for enhanced representation in three dimensions. A useful way to produce pseudo-3D representations is to preserve the planar character of each stroke while expanding moves by adding a constant lift to every incoherent vector. As shown in Figure 5, this effectively maximizes visibility by stacking the strokes on layered planes spanned by moves. This approach will be used to visualize complete vector chains with the understanding that the working form (2D with incoherent vectors invisible) provides the connection to function.
Figure 5. Layered 2D representation of vector proteins.
Strokes (green) are placed on successively higher planes by rendering moves (blue) with a vertical component added to each vector.
doi:10.1371/journal.pone.0002246.g005Tertiary structure: Vector paths analogous to backbone structures. Protein tertiary structure is characterized not only by the spatial arrangement of secondary structure elements but also by topology—how these elements are connected. Figure 6A illustrates this with two four-strand β sheets. Although the two sheets differ in geometric details like strand length and curvature, the color patterns highlight a more fundamental topological difference: the strands are ordered differently along the protein chains. Another key aspect of tertiary structure pertaining to sheets is strand direction, which may be parallel (i.e., uniform, as in this example), antiparallel, or mixed. All of these topological aspects of tertiary structure—arrangement, direction, and connectivity—have direct parallels in the vector world. For example, Figure 6B shows two vector proteins that both arrange their strokes in the form of (U+5DDE), but they do so by means of different stroke directions and orders. Like the alternative sheet structures of Figure 6A, these vector proteins have fundamentally different tertiary structures.
Figure 6. Chain topology in real and vector proteins.
A) Sheet regions of 1VHR (left) and 1D1Q (right) with color running from blue to red in the amino-to-carboxyl direction. B) Vector proteins that perform the function of (U+5DDE) by means of different topologies, colored blue to red in the tail-to-head direction.
How many distinct tertiary structures are possible? No clear way of answering this exists for real proteins, though it has been suggested that far more are possible than have been put to use in the biosphere [16]. Of all possible structures, some fraction would be capable of performing the biological functions of the natural proteins. Even if this fraction is small, it may include a great many more folds than the natural ones.
The relative simplicity of the vector world enables some of these numbers to be calculated. The number of fundamentally different ways for a vector protein to perform the function of , for example, is 46,080 (the number of ways to order the 6 strokes, multiplied by the number of ways to vary the direction—up or down—through a specified stroke order). For a character with n strokes, the number of alternatives is 2n·n!, which grows very rapidly as n increases: 3840 alternatives for 5 strokes, a million-fold more for 10 strokes, and ten-million-fold more again for 15. Considering that 9 or 10 strokes is a rough average for the set of characters in common use (9-stroke median, 10-stroke mean, based on the set of characters assigned USENET frequencies of 1, 2, or 3 in the Unihan database; http://www.unicode.org/Public/UNIDATA/Unihan.html), it is clear that the number of distinct vector folds that perform the function of any Han character vastly exceeds the number of characters.
Fold organization: Vector-protein domains analogous to protein domains. Real proteins with more than about 150 amino-acid residues tend to fold with secondary structure grouped into two or more regions. In some cases it appears that these regions correspond to folding domains—portions of the protein chain that fold as independent units [17]. Sequence and structure comparisons across diverse protein families likewise suggest that proteins are composed of multiple parts. A domain-sized part of one protein is often found to have counterparts in other structural contexts, suggesting that structural and functional modularity have enabled evolutionary recombination of parts [18], [19]. An example of this is the NAD-binding domain, shown in two of its structural contexts in Figure 7A.
Figure 7. Domains as sub-structures with sub-functions.
A) Two proteins that use similar NAD-binding domains (orange). Left: α-glucosidase monomer from Thermotoga maritima (PDB 1OBB). Right: L-lactate dehydrogenase monomer from Bacillus stearothermophilus (PDB 1LDN). B) Two vector proteins that use similar domains (purple) as described in text.
Because the Han characters have their own evolutionary history, with structural and functional modularity playing a major role, the vector world described here inherits these features. To retain these historical characteristics, the vector world is based on the traditional character forms used in Hong Kong and Taiwan (simplified versions of many of these characters being used in China and elsewhere). Figure 7B shows two vector proteins (functioning as [U+8C58] and
[U+8C5D]) that share a group of strokes. Like numerous other groups, this one performs a sub-function that appears in many structural combinations, making it akin to a protein domain. In this case the sub-function is that of
(U+8C55), which means pig. As is often the case for proteins, the composite functions show similarity that derives from the shared structural component:
means small pig, and
means sow.
Both worlds show considerable variation in how domains fit together to form multi-domain structures. For example, one of the NAD-binding domains (Figure 7A, left) is considerably more entangled with its complementing domain than the other, implying a more complex interface between the domains. Although both depicted vector proteins have simple left–right domain partitioning, Han characters often show more complex arrangements. Examples of this involving (U+8C55) include
(U+4747),
(U+8C61), and
(U+8C73).
Quaternary structure: multi-character words analogous to multimeric proteins. Most proteins perform their biological functions as part of protein complexes, which involve either identical protein molecules or different kinds bound together in specific and often symmetrical arrangements (http://www.3Dcomplex.org). Written Chinese provides an analogy here as well. Although the Han characters all originally functioned as stand-alone words, the number of concepts needing words has increased dramatically since the character set became effectively fixed. Instead of inventing new characters, the solution was to combine existing characters to form multi-character words, which are now common. These words are like multi-protein complexes in that their function requires correct arrangement of two or more parts. However, while protein complexes are compound structures, multi-character words are separate structures arranged sequentially. The next section explains how this is implemented in the vector world and considers the implications for functional constraints.
High-level functions: From sentences to texts, and operons to proteomes. In both biology and language, the jump from elementary function to useful function brings with it a new level of complexity. Words are elementary semantic units, in that meanings are attached to symbols starting at the word level. But language only becomes useful for communication when word-level meanings are combined to convey more complex meanings. Similarly, although proteins and protein complexes perform low-level functions of biological relevance, organismal capabilities—from survival-enhancing phenotypes all the way up to survival itself—require the coordinated combination of many such functions. Ultimately whole proteomes are coordinated in this way.
In bacterial genomes, the first level of coordination is often achieved by arranging genes in co-regulated blocks called operons (Figure 8A). While there is obvious similarity between genes arranged to produce operon-level functions and words arranged to produce sentence-level functions, gene order appears to be less critical to genome function than word order (syntax) is to linguistic function. If imitation of the protein world were the main objective, the model could be altered to resolve this dissimilarity. But because incorporation of real function is the priority, our approach is instead to allow the vector world to have the properties it naturally inherits from its real linguistic basis. High-level functions in this world are therefore encoded by arranging genes according to the rules of syntax (Figure 8B).
Figure 8. The operon-like structure of vector-world genes encoding a sentence.
Gene names shown in white, with functional notation above or below. A) The genetic structure of the histidine operon of Escherichia coli (adapted from EcoCyc, http://ecocyc.org). B) The genetic structure of a vector-world gene suite encoding a sentence-level function (see text). Genes are named according to the Unicode number of their function.
doi:10.1371/journal.pone.0002246.g008Although these rules result in tighter gene-order constraints in the vector world, this appears to be offset by the absence of two substantial protein constraints. One of these is the requirement, discussed above, for many proteins to form specific multi-protein complexes. Some idea of the difficulty of achieving one specific pair-wise association in these complexes can be had by estimating the number non-productive alternatives that must be avoided. Because that number is quite large (being the number of different surfaces of any kind that compete for interaction), it can easily exceed the number of genes in a bacterial genome. Consequently, even highly stringent constraints on gene order in the vector world are apt to be less restrictive than the constraints of quaternary structure in the protein world.
The second missing constraint in the vector world has to do with specificity of function. Figure 9 shows two proteins that are considered to have the same structure for the purposes of structural classification (e.g., the SCOP classification places them in the same family: http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.d.jc.b.f.html). But structural classification necessarily ignores details of structure, focusing instead on secondary structure content and arrangement, and overall chain topology. At the atomic level of active-site structure, where function is determined, these proteins differ decisively. Neither will substitute for the other, and no simple change of just a few amino acids appears to be capable of converting one function to the other (A. Gauger and D. Axe—manuscript in preparation).
Figure 9. Functional specificity of real proteins depends upon atomic-level details.
The products of the bioF and kbl genes of E. coli are virtually indistinguishable at the fold level, but the structural differences produce different functions. Left: BioF monomer (PDB 1DJE), which functions as a dimer in biotin biosynthesis. Right: Kbl monomer (1FC4), which functions as a dimer in threonine degradation.
doi:10.1371/journal.pone.0002246.g009Written language behaves differently, as Figure 8B illustrates. The depicted group of genes encode vector proteins that mean: My fish has eaten your fish. Notice that the final two gene functions () are identical to the second and third. These character pairs are a possessive suffix followed by the symbol for fish, indicating in both instances that the fish referred to belongs to the person just mentioned. The symbols are of course completely interchangeable, but the things they refer to—the fish—are not. In other words, the interpretations of the two instances of
in this sentence differ, even though the symbols are identical. The different meanings result not from structural differences but from syntax—from the different contexts in which the symbols appear. So, what atomic-level structure does for real proteins (provide specific function) syntax does for characters and therefore for vector proteins.
What this facilitates in the vector world is gene recruitment, the process of duplication and functional conversion thought to explain paralogous proteins [20]. In the vector world, genes serving existing functions can produce new high level functions (phrases, sentences, etc.) simply by appropriate side-by-side arrangement of duplicates. Although the protein world does not always require arrangement of this kind for a new high level function (metabolic pathway, molecular machine, etc.) to be formed, the structural reconfiguration of binding surfaces and active sites that it does require appears to be more demanding.
Functional proficiency and fitness in the vector world.
As indicated, many aspects of the relationship between structure and function in Chinese writing are real-world facts (often complicated ones), with no need for special treatment in the vector world. One aspect that does call for special treatment, though, is legibility—how well written characters conform to the expectations of readers. This has both a qualitative aspect—What character does this resemble?—and a quantitative aspect—How close is the resemblance? The conventions of Chinese writing provide a qualitative framework for answering the first question, but quantitative answers for the second will be needed for calculating the functional proficiency of vector proteins. This will require both precise standardization of character forms and a precise measure of resemblance, which in turn requires a precise definition of resemblance.
Since human reading cannot be characterized with the required precision, what is needed is a mathematical treatment of resemblance that shows reasonable correspondence with human perception. This will necessarily be much simpler than human character recognition, but to the extent that the human process amounts to an assessment of geometric likeness, we can expect a mathematical assessment of geometric likeness to provide a plausible mapping of structure to functional proficiency. If this is achieved, we will have a mathematical model that defines functional proficiency in the vector world in a way that ties it to something intelligible—legibility (as an aside, existing character-recognition algorithms were found unsuitable because they rely on features that correlate with intended character forms rather than define them, which works when it can be assumed that all forms analyzed are legitimate, but not when structural legitimacy is a point in question).
Han archetypes. Asian fonts provide a starting point for specifying ideal forms for Han characters, which we refer to as Han archetypes. However, because fonts show considerable variation in stroke styles and, in extreme cases, even in stroke composition (Figure 10), it is necessary to designate one font as the standard. The primary considerations here are geometric simplicity, widespread availability, and coverage of the traditional character forms. Arial Unicode is most suitable in these respects and has therefore been adopted as the standard.
Figure 10. Building archetypes for Han characters.
Left: U+8FF4 shown in fonts STFangSong, LiSong Pro, and MS Mincho (top to bottom). Arial Unicode (center) is the chosen standard for archetypes, which are scaleable geometric specifications (right; see text).
doi:10.1371/journal.pone.0002246.g010In terms of structure, characters are simply strokes of particular shapes arranged in a particular way. Han archetypes reflect this by restricting specifications to these structural fundamentals. In particular, conventions of writing technique—the order and direction of stroke formation—are not included. Archetypes are based on line representations of the Arial Unicode forms with individuated strokes (Figure 10). The shape specification for a stroke consists of two or more points designating the ends of the line or curve segments that form the stroke, along with one Bézier control point for each curve segment.
As discussed above (Fold organization), many of the Han characters are built from significant components which may themselves function as stand-alone characters. Because component recognition is an important part of human character recognition (and this mirrors the component-like structure of many real proteins) we include component definitions in archetype specifications. This is done by grouping strokes according to components (if any). Because the aspect ratio of character components is commonly altered in the formation of compound characters (see Figure 2, top), the vector world allows arbitrary rescaling of archetypes with variable aspect ratio (Figure 10). In addition to stroke groupings and shape specifications, a complete archetype specifies stroke placement and any constraints on contacts between strokes (see Text S1 for details).
Mathematical Model. If the geometric likeness of a vector protein (working form) to a specified Han archetype can be characterized by a set of separable error metrics ε1, ε2, …εn having uniformly multiplicative effects, the combined effect on functional proficiency would be described by a decay function of the form:(1)
where the constants ε ̃i are set to values that reflect the relative influence of each εi on legibility. Since the objective here is to define a proficiency function that captures the key aspects of structural correspondence rather than to model the actual process of human perception, we have chosen this simple form as our basis. We use twelve error metrics to characterize geometric likeness in terms of the shape, size, placement, and connectivity of strokes, the size and placement of domain-like groups of strokes, and the presence of any overall flaws like extraneous marks or gaps within strokes (see Text S1 for details). As shown below, this way of mapping structure to functional proficiency does provide reasonable correspondence both with human perception and with the protein world.
Calculation of a proficiency score begins by scaling vector strokes individually such that their widths and heights match those of the corresponding archetype strokes (Figure 11). Scaled vector strokes and their archetypes are then overlaid in order to quantify shape distortion. By experimenting with root mean square deviation (RMSD) as a shape distortion metric, we found that archetype strokes consisting exclusively of horizontal and vertical lines allowed much less conformational freedom in vector proteins than strokes with curves did. Maximum deviation was then tested and found to provide more uniform conformational freedom along with comparably good representation of readability. We therefore chose this metric for quantifying shape distortion (see Text S1 for details of calculation).
Figure 11. Assessing shape distortion of vector strokes by comparing with ideal forms.
Colors differentiate the three strokes forming a component of (U+5F35). Dots show vector boundaries. Left: strokes from a vector protein with a proficiency score of 0.4 (shown in Figure 12). Middle: the ideal structure specified by the archetype. Right: scaled vector strokes laid over their archetype forms, with bounding rectangles shaded. Shape distortion is assessed for each stroke individually, the top stroke in this example having no distortion.
To reflect the importance of components within compound characters, group proficiencies are calculated separately for each grouping of vector strokes defined by the archetype. Group proficiencies reflect not only the average maximum deviation of the contained vector strokes, but also any inconsistency in their scaling or placement. Equation 1 is applied with an εi metric representing each of these errors. The proficiency for the whole character is then calculated from the group-level proficiencies, taking further account of any errors in the structural arrangement of groups within the character. Equation 1 applies again at this level, but instead of including errors internal to all groups, only those pertaining to the least proficient group are included. This “weakest-link” approach reflects that fact that the whole function results from different components performing their own sub-functions, such that overall proficiency is most readily achieved by comparably proficient sub-components. Similar reasoning applies in the case of a multi-character message. Since each character performs a separate sub-task, and the overall task amounts to adequate performance of each of these sub-tasks, the functional proficiency of a message is simply the lowest proficiency of its constituent characters (Zeldovich and co-workers likewise used the weakest-link approach, as described in reference 8).
A simple way to see whether proficiency scores computed in this way show reasonable correspondence with legibility is to subject highly proficient genes to random point mutations, accepting only those that leave the proficiency above a specified threshold. Because mutations tend to be disruptive, propagated lines evolved in this way tend to hover just above the threshold. So, by lowering the threshold in small steps, we can produce a long line of descent showing gradual decline in proficiency. The software and methods for doing this kind of experiment will be introduced next. Here we aim merely to verify the intended qualitative connection between proficiency and legibility. Figure 12 shows snapshots at various stages of decline in three unrelated lines. Legibility shows a similar decline with decreasing proficiency in all cases, indicating that calculated proficiencies correlate reasonably well with actual function.
Figure 12. Qualitative correlation between functional proficiency of vector proteins and their legibility.
Initial genes encoding (U+4EAB),
(U+5F35), and
(U+684C) were generated with Inscribe and processed with Stylus (see Softwa
Start a discussion on this article