An Algorithm for Protein Helix Assignment Using Helix Geometry

Chen Cao; Shutan Xu; Lincong Wang

doi:10.1371/journal.pone.0129674

Abstract

Helices are one of the most common and were among the earliest recognized secondary structure elements in proteins. The assignment of helices in a protein underlies the analysis of its structure and function. Though the mathematical expression for a helical curve is simple, no previous assignment programs have used a genuine helical curve as a model for helix assignment. In this paper we present a two-step assignment algorithm. The first step searches for a series of bona fide helical curves each one best fits the coordinates of four successive backbone C_α atoms. The second step uses the best fit helical curves as input to make helix assignment. The application to the protein structures in the PDB (protein data bank) proves that the algorithm is able to assign accurately not only regular α-helix but also 3₁₀ and π helices as well as their left-handed versions. One salient feature of the algorithm is that the assigned helices are structurally more uniform than those by the previous programs. The structural uniformity should be useful for protein structure classification and prediction while the accurate assignment of a helix to a particular type underlies structure-function relationship in proteins.

Citation: Cao C, Xu S, Wang L (2015) An Algorithm for Protein Helix Assignment Using Helix Geometry. PLoS ONE 10(7): e0129674. https://doi.org/10.1371/journal.pone.0129674

Editor: Freddie Salsbury, Jr, Wake Forest University, UNITED STATES

Received: February 23, 2015; Accepted: May 11, 2015; Published: July 1, 2015

Copyright: © 2015 Cao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All relevant data are within the paper.

Funding: The authors have no support or funding to report.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Historically helices were proposed as the main secondary structure elements for proteins in 1951 [1] through model building using low-resolution X-ray diffraction data well before atomic coordinates could be determined from high-resolution data [2, 3]. As is evident from the helix model, the hydrogen bonding interaction between an amino (NH) group and a carbonyl (CO) group plays a decisive role in helix stability. The early recognition of the importance of hydrogen bonding interaction greatly affects our understanding of helices in proteins. In fact, the de-facto definitions of the three types of helices (α, 3₁₀ and π-helices) are based on their distinct hydrogen bonding patterns: the hydrogen bonds between the CO of residue i and the NHs of residue i + 3, i + 4 and i + 5 are used respectively for the definitions of 3₁₀, α and π helices. Largely as a consequence of the characteristic hydrogen bonding pattern and van der Waals repulsion, the backbone ϕ and ψ angles of a helix residue lie in two well-separated regions with the larger one corresponding to the right-handed helices while the much smaller one the left-handed ones. For the same reasons there exist no large variations in the derived geometrical restraints such as the virtual bond length between two successive C_α atoms, the virtual bond angle formed by a triple of successive C_α atoms and the dihedral angle formed by a quadruple of successive C_α atoms. The unique hydrogen bonding pattern and the small variation in the derived geometrical restraints provide the foundation for the previous assignment algorithms that use either hydrogen bond (dssp(Dictionary of Secondary Structure of Proteins) [4], stride [5] and secstr [6] and sst [7]) or geometrical restraint as inputs. The latter includes an early method developed by Levitt and Greer [8], define-s [9], p-sea [10], p-curve [11], palsse [12], stick [13], xtlsstr [14], kaksi [15] and the most recent program disicl [16]. The program stride [5] uses both hydrogen bond and geometrical restraint. At present, the hydrogen bond-based program dssp [4] is arguably the most popular helix assignment program. However, it has become clear quite early on that the hydrogen bonding pattern though unique for each helix type is not the sufficient condition for helix assignment, as is evidenced by the continuous development of restraint-based programs. Though there exist more than a dozen assignment programs at present, the accurate assignment of a protein helix remains to be a challenging problem [17, 18] as illustrated by the following comparative studies. It has been shown that the percentage of agreement between dssp, define-s and p-curve was only 63% on a residue basis [19]. The discrepancies and inconsistencies among the previous programs may well originate from their imprecise problem definitions since instead of rigorously following the helix geometry, they formulate the assignment problem as a restraint satisfaction problem in terms of the restraints that either could not be computed accurately (e.g. hydrogen bond) or have no precise range (e.g. ϕ/ψ angles) or are not sufficient for defining a general helical curve (e.g. virtual C_α bond length and angles).

In this paper we present a two-step algorithm that follows the division of the assignment problem into two separate problems: a minimization problem and a restraint satisfaction problem. The minimization problem is solved in the first step by a curve fitting algorithm that searches for a series of bona fide helical curves each one best fits the coordinates of a quadruple of successive C_α atoms. From the best fit helical curves we calculate three helix scores (one for each helix type), a helix axis angle and a C_α RMSD(root-mean-square deviation) for each residue that are in turn used in the second step as input to make helix assignment. A helix score for a residue quantifies the deviation from a standard protein helix (see section 2.2 for a precise definition of the term the standard protein helix) of a best fit helical curve that starts with the residue. The accurate assignment of a helix to a particular type is made possible by the following two observations: (1) each helix type has its own distinct helix score distribution and unique standard helix, and (2) a helix residue has a smaller helix score than a non-helix residue. The algorithm makes no use of hydrogen bond, ϕ/ψ angles, backbone NH or CO coordinate, virtual bond length or angle.

We have applied the assignment algorithm to identify the helices in the protein structures in the current version of PDB [20] and compared our assignment with those by the nine previous programs: dssp, stride, p-sea, kaksi, palsse, stick, xtlsstr, stt and disicl. The results demonstrate that the algorithm is able to assign accurately not only α but also 3₁₀ and π-helices as well as left-handed helices [21]. To compare the structural distribution of the helices assigned by our and the previous programs, we have used a geometric clustering algorithm [22] to classify several sets of helices with the helices in each set all having the same length. The clustering analysis proves that the helices assigned by our algorithm are structurally more uniform than those by the previous ones. The accurately assigned helices and the helix clusters as well as the common structural features shared by all the helices in a cluster should be particularly useful for protein structure classification and prediction as well as secondary structure prediction while the accurate assignment of a helix to a particular type should provide a basis for the discovery of structure-function relationship in proteins.

The rest of the paper is organized as follows. In section 2 we describe the data set and present our algorithm. The application to the two sets of protein structures in the current PDB is detailed in section 3. In section 4 we compare the algorithm with the nine previous programs and discuss its key advantages. Finally we conclude the paper in section 5.

2 The data set and the algorithm

In this section we first describe the data set. We then present the helical curve fitting algorithm and the definitions and computations of helix score, helix axis angle and C_α RMSD as well as the definitions of three standard protein helices. Finally we detail the assignment algorithm itself.

2.1 The data set

To evaluate the performance of our algorithm, we have downloaded from the current version of the PDB a non-redundant set of x-ray structures (29,093 in total) with at most 70% sequence identity and each one has at least one helix according to the PDB. Out of them we have selected a set of 3,287 high-resolution structures (set 𝕊) each one has at least three helices, a resolution ≤ 2.0Å and a R-factor ≤ 25.0% to obtain the statistics for four helical parameters and two RMSDs. The set of the remaining 25,806 structures (set ℕ) is used to evaluate the performance of our algorithm and two previous programs dssp and stride. To compare our algorithm with the five programs (kaksi, plasse, stick, stlsstr and sst) that we are not able to obtain a local copy we have uploaded to a web server [23] a set of 100 x-ray protein structures (set 𝕋) with the first 50 structures having resolutions between 1.0Å–2.0Å and the rest having resolutions ≥ 2.5Å(1AKG,1BGF,1EZW,1GSU,1I1N,1K1B,1NTE,1O98,1PVM,1SJD,1UKF,1VZY,1XGW,1YQD,2ASC, 2BJI, 2CWH,2FD5,2GB2, 2GG6,2H1V,2I2C,2NSF, 2POK,2VBA,2W6K,2WRA,2X7H,2YSK,2ZJ3,3C9U,3EDF, 3GG7,3HG7,3IDV,3LCC,3LFJ,3P4H,3PUA, 3TOU,3V7Q,3ZOO;1AZ2,1BTN,1CAX,1F1F,1FXA,1HMY,1KX8, 1MJ9,1MSC,1NJ1,1PYP,1RIN,2CND,2GAE,2HAF,2HXB,2QQV, 2RJQ,2SPT,2WO7,2X2B,2Y5Q,2ZIY,3AT0, 3G0A,3IIA,3L1G,3M3T,3NPE,3O5K,3PGK,3PMQ,3QNT,3QYB,3T2Y,3UUF,3VGE, 3WMF,4AYH,4BOY, 4EWS,4GB0,4JVC,4LIF,4MBR,4N3G,4O1S,4PUT,4QGL,4TW8) to obtain their helix assignments. Set 𝕋 is also used for a comparison with the program disicl.

2.2 The curve fitting algorithm and helix assignment

Our solution to the helix assignment problem consists of two steps. The first step is to solve a minimization problem by a helical curve fitting algorithm that searches for a series of genuine helical curves each one best fits the coordinates of four successive C_α atoms. A helix model composed of a series of helical curves has been previously called a polyhelix [24]. We then define a standard protein helix for each helix type. The algorithmic solution to the minimization problem makes it possible to compute a helix axis angle a_i, three helix scores h_i, g_i and π_i, and two C_α RMSDs for residue i. A helix score quantifies the deviation from a standard protein helix of the helical curve that begins with this particular residue. The score, axis angle and C_α RMSD are used in the second step as input to make helix assignment.

2.2.1 The helical curve fitting algorithm.

A general helical curve in three dimensional space could be represented as: (1) where r = {x, y, z} is a point on the curve, r₀ = {x₀, y₀, z₀} its origin, and R the rotation matrix that specifies its helical axis n with respect to a coordinate system. The first three helical parameters, radius (r), pitch (p) and turn angle (t), define a standard helical curve, x = r sin t, y = r cos t, z = pt with its origin at {1.0,0.0,0.0} and its axis along the +Z axis. Together with n and r₀ these five parameters completely define a general helical curve. Though r, p, t could be computed directly from the virtual bond length, bond angle and dihedral angle of a quadruple of C_αs [25], no simple analytic expression has been derived for the computation of a helical curve that best fits the coordinates of a quadruple of C_αs, that is, a helical curve that has the minimum RMSD (Δ_i) between the four C_αs and their closest points on the curve. In fact, this minimization (or curve fitting) problem is equivalent to finding the solutions to a high-degree monomial. In the following we describe briefly an algorithmic solution to this minimization problem.

We begin with the computations of r, p and t using previously-derived analytic expressions [25], and denote their values as r_m, p_m and t_m. Then we proceed as follows to search discretely and exhaustively over two intervals, [r_m − d_r, r_m + d_r] and [p_m − d_p, p_m + d_p], for the r and p values of a helical curve such that it best fits the coordinates of a quadruple of C_αs of residue i, i + 1, i + 2 and i + 3. Both d_r and d_p are user-specified constants.

1. Δ_i = ∞ {the initial RMSD}

2. For each r in [r_m − d_r, r_m + d_r]

For each p in [p_m − d_p, p_m + d_p]

Compute t {the turn angle}

Generate a helical curve {by Eq 1}

Best-fit the curve to the four C_α coordinates using singular-value decomposition(SVD) to compute R

If R_ms < Δ_i

Δ_i = R_ms

r_i = r, p_i = p, t_i = t, R_i = R

where R_ms is the RMSD between the quadruple of C_αs and their closest points on the helical curve; r_i, p_i, t_i and R_i are, respectively, the helical parameters and rotation matrix for the helical curve that best-fits the quadruple. Given both r and p and the distance d_i,i+1 between two consecutive C_αs, t could be computed as follows: . Singular-value decomposition (SVD) is applied to compute Δ_i and rotation matrix R_i; and from R_i, the helical axis n_i for this helical curve could be calculated. In fact, the SVD step guarantees that the computed helical curve best fits the coordinates of the quadruple of C_αs. A set of six helical parameters (r, p, t, Δ, R and n) is computed for a protein chain by sliding over its sequence a window of four C_α atoms.

2.2.2 The computation of three helix scores and helix axis angle and C_α RMSD.

Except for the last three residues at the C-terminus of a protein chain three helix scores, h_i, g_i and π_i, are computed for each residue i. They are used respectively for the assignment of α,3₁₀ and π helices. (2) where r_i, p_i, t_i, Δ_i are computed as above using a quadruple of C_αs of residue i, i + 1, i + 2 and i + 3. The constants μ_r, σ_r; μ_p, σ_p; μ_t, σ_t and σ_Δ are respectively the normal distribution parameters for r, p, t, Δ that are determined as follows over the respective data sets for r, p, t, Δ calculated on the non-redundant set 𝕊. We first apply the program dssp [4] to assign 31,383 α-helices, 11,926 3₁₀-helices and 1,156 π-helices for the structures in 𝕊, and then for each helix type we calculate its r, p, t and Δ values. The r, p, t data sets for π-helix are calculated differently than those for either α or 3₁₀-helices. If dssp assigns a π-helix say composed of residues i, i + 1, i + 2, i + 3, i + 4, then the final value for each r, p, t is the average over the three values for the first three residues, that is, , , . Each triple of parameters μ_r, μ_p and μ_t defines a standard helical curve for a helix type that represents an average over all the helices of that particular type in 𝕊. For ease of reference we call them respectively the standard protein helix for α,3₁₀, and π helices. The helix scores h, g and π are computed with respect to the respective α,3₁₀ and π standard protein helices. The score measures the local deviation of the helical curve from the standard protein helix for that particular helix type: the higher the score the larger deviation from the standard protein helix. The term quantifies the spatial difference between a C_α atom and its closest point on the helical curve. In fact, Δ together with the r, p, t terms in Eq 2 and the helical axis n provide a pure geometrical definition for a helix in a protein, that is, as long as a segment of C_α coordinates conform to a genuine helical curve, it is assigned as a helix.

The minimum RMSD Δ_i for residue i is computed over the quadruple of residues i, i + 1, i + 2 and i + 3 and thus is useful for the determination of the N-terminus of a helix. For the determination of the C-terminus of a helix, we have computed a C_α RMSD δ_i for residue i using up to four helical curves best fit to four successive quadruples of C_αs starting with the quadruple of residues i − 2, i − 1, i and i + 1. The RMSD δ measures the goodness of fitting of up to four consecutive helical curves to seven successive C_α atoms. In the current implementation δ is used for the extension of the C-terminus of a 3₁₀-helix and the possible merge of two adjacent α-helices. In addition to the helix scores and δ, we have also calculated the angle between two successive helical axes n_i−1 and n_i and use it as input to the assignment algorithm. This angle measures the bending of the current helical curve starting at residue i relative to the previous helical curve starting at residue i − 1. For ease of reference we call this angle helix axis angle and call the four variables r_i, p_i, t_i, a_i the four helical parameters for residue i. For the first residue at the C-terminus we set a_i = 0.0° and for a genuine helical curve all of its a_is are zero. The set of four helical parameters for all the quadruples of consecutive C_αs in a protein chain are computed by sliding over its sequence a window of four C_α atoms.

Four thresholds h_T, h_max, g_T and π_T for the three helix scores, four thresholds a_T, a_max, a_G and a_I for the helix axis angle and two thresholds, δ_G and δ_max, for δ are required by our assignment algorithm. These ten thresholds are determined as detailed late by the analyses of the statistics for both helical parameters and RMSDs obtained on all the helices in 𝕊 assigned by the program dssp.

The five parameters, r, p, t, r₀, n, completely defines a right-handed helical curve. By inverting just one component of every C_α coordinate, say from {x, y, z} to {−x, y, z}, the same five parameters together with the axis angle and two RMSDs could be computed similarly and used for the assignment of the left-handed helices.

2.2.3 The helix assignment algorithm.

The assignment proceeds as follows using helix score, axis angle and C_α RMSD as well as the ten thresholds as input. The assignment for each protein chain starts with π-helix.

let b_i = 0, i = 0, …, n

while i < n AND b_i == 0 {residue i has NOT been assigned}

If π_i < π_T AND a_i < a_I AND a_i+1 < a_I AND a_i+2 < a_I {the N-terminus of a π helix}

i ∈ π-helix {assign residue i to π helix}

b_i = 1

i + +

while π_i < π_T AND i < (n − 3) AND b_i == 0

i ∈ π-helix

b_i = 1

i + +

for j = i, i + 1, i + 2, i + 3

j ∈ π-helix

j + +

b_j = 1

i = j

i + +

where n is the number of residues in a protein chain, i residue index, and π_T a threshold for helix score π and a_I a threshold for helix axis angle used only in π-helix assignment.

Next the algorithm assigns 3₁₀-helices for the remaining residues of the same protein chain.

while i < n AND b_i == 0 {residue i has NOT been assigned}

If g_i < g_T AND a_i < a_G {the N-terminus of a 3₁₀ helix}

i ∈ 3₁₀-helix {assign residue i to 3₁₀ helix}

b_i = 1

i + +

while g_i < g_T AND i < n AND b_i == 0

i ∈ 3₁₀-helix

b_i = 1

i + +

if δ_i < δ_G AND δ_i+1 < δ_G {assign two more residues to 3₁₀ helix}

i ∈ 3₁₀-helix

b_i = 1

i + +

i ∈ 3₁₀-helix

b_i = 1

i + +

where g_T, a_G and δ_G are respectively the thresholds for helix score g_i, axis angle and RMSD δ used only in 3₁₀-helix assignment.

Finally the algorithm assigns α-helices for the remaining residues of the same protein chain.

while i < n AND b_i == 0 {residue i has NOT been assigned}

If h_i < h_T AND a_i < a_T AND h_i+1 < h_T {the N-terminus of a helix}

i ∈ α-helix {assign residue i to a helix}

b_i = 1

i + +

while h_i < h_T AND i < n AND b_i == 0

i ∈ α-helix

b_i = 1

i + +

where h_T a threshold for helix score h_i. For the assignment of α-helices only, additional steps may be needed to merge two α-helices adjacent in protein sequence and to extend their C-termini. The merge of two adjacent α-helices proceeds as follows. For every pair of adjacent helices less than four residues apart, if the axis angle a_i ≤ a_max, helix score h_i < h_max and δ_i < δ_max for every intermediate residue i, then the two helices are merged into a single helix. The thresholds h_max, a_max and δ_max are the respective thresholds for axis angle, helix score and δ. The threshold δ_max is used only in the merge step.

After the merge step the C-terminus of an α-helix may be extended as follows.

let j − 1 be the last residue of an α-helix assigned above

if j < n AND b_j == 0 {residue j has NOT been assigned}

If h_j < h_max AND a_j < a_max

b_j = 1

j ∈ α-helix

j + +

b_j = 1

j ∈ α-helix

where h_max and a_max are the same thresholds in the merge step.

Though the three types of helices are assigned similarly at individual helix level, no merge step is necessary for either 3₁₀ or π helix since there exist rarely 3₁₀ or π helices with more than eight residues. In our current implementation, no extension is made for π helix.

A left handed helix is assigned similarly except that the helical parameters and RMSDs used in Eq 2 are those computed from a quadruple of inverted C_α coordinates as described above.

2.3 Helix classification

To better compare the assigned helices made by our and previous programs and to characterize their structures we have classified them using our geometric clustering algorithm [22]. The clustering is performed on sets of helices that have the same length. The RMSD threshold for clustering is 1.5Å.

3 Results

The input to our algorithm are the helix scores h_i, g_i, π_i, axis angle a_i and δ_i for every residue i and their corresponding thresholds h_T, h_max, g_T, π_T, a_T, a_max, a_G, a_I, δ_G and δ_max. The computations of h_i, g_i, π_i by Eq (2) require the seven parameters μ_r, σ_r;μ_p, σ_p;μ_t, σ_t and σ_Δ to be known in advance. In this section we begin with the determination of these seven parameters and the ten thresholds through the statistical analyses of the data sets for r, p, t, Δ, a and δ. These data sets are computed using the helical curves fitted to the dssp-assigned helices in set 𝕊. We then present and compare the assignments made for both ℕ and 𝕋 by our algorithm and nine previous programs including three hydrogen-bond based ones (dssp (version 2.2.1), stride and stt) and six geometrical restraint based-ones (p-sea, palsse, stick, xtlsstr, kaksi and disicl). Finally we describe briefly the algorithmic implementation.

3.1 The statistical analyses of helical parameters and RMSDs

The premises of our assignment algorithm are (1) each helix type has its own distinct helix score distribution and (2) a helix residue has a smaller helix score than a non-helix residue. To test how good the premises are we first use the program dssp to divide all the protein residues in 𝕊 into five groups: α-helix (H), 3₁₀-helix (G), π-helix (I), β-sheet or extended configuration (E) and the other (R), and then for each group obtain separate statistics for the four helical parameters and two RMSDs. The statistics for different helix types are computed somewhat differently. The α-helix statistics include the data for every residue of an α-helix except the last three residues of its C-terminus. The dssp assigned 3₁₀-helices (group G) have ≥ 3 residues per helix with an average length of 3.3 residues. The statistics for the 3₁₀-helices are computed over 1–3 sliding windows starting with the residues i − 1, i, i + 1, i + 2 where i is the first residue of a 3₁₀-helix. The four helical parameters and two RMSDs for the dssp assigned π-helices (group I) have larger variations than those for either α or 3₁₀ helices. However, we found that the r, p, t values averaged over the first three residues of every dssp assigned π-helix have well-defined distributions. Consequently the π-helix statistics is obtained over such averages.

The statistics for the four helical parameters and two RMSDs are fitted to a normal distribution to compute their normal distribution parameters μ and σ. As shown in Table 1 and Fig 1, for every helix type each r, p, t statistics (Fig 1a–1i) could be fitted very well to a normal probability function. In addition, the σs are rather small and the μs well separated from each other (δ_p in Table 1 and Fig 1j, 1k and 1l). More importantly the r, p, t statistics for a helix group (H or G or I) are well separated from the statistics for group E (Fig 1m, 1n and 1o). Furthermore, the largest peaks in the r, p, t distributions for a helix group are well separated from the largest peaks in the r, p, t distributions for group R (Fig 1p, 1q and 1r) though the largest peaks in the former do overlap with the minor peaks in the latter. The statistics for a, Δ and δ for each helix type could also be fitted to a normal probability function reasonably well (data not shown). The μ and σs for Δ are respectively (0.02,0.031), (0.045,0.038) and (0.129,0.056) for α, 3₁₀ and π helices. The smallness of the σs, the large separation between the μs for different helix types, and the large difference between the μs for a helix group and the μs for either E or R group are consistent with the two premises. They are the reasons why the helix scores h, g, π could be used to assign a helix to a particular type.

Download:

Table 1. The normal distribution parameters (μ, σ) for r, p, t.

The residues have been divided into five groups (H, G, I, E, R) based on the secondary structure elements assigned by dssp. The last row (δ_P) shows the difference in μ between H and the other four groups. The unit for both r, p is Å while t is in degree.

https://doi.org/10.1371/journal.pone.0129674.t001

Download:

Fig 1. The r, p, t statistics for five different groups (H, G, I, E, R).

The five triples of figures (a, b, c), (d, e, f), (g, h, i), (m, n, o) and (p, q, r) are respectively the r, p, t distributions for α-helix (H), π-helix(G), 3₁₀-helix (I), β-sheet (E) and the other (R) groups. Except for figures j, k and l whose y-axes are the scale for a normal distribution, the y-axes in all the other figures are the number of residues. For each row, the x-axis is respectively the radius (r in Å), pitch (p in Å) and turn angle (t in degree). There is a small peak in figures d, e, f from the left-handed helices. Figures j, k and l display the fitted r, p, t normal distribution curves for the three helix types where each line segment has a length of 2.0σ where σ is the standard deviation. All the overlaps are well below the 2σ lines except for t statistics where the overlap between the α-helix normal curve and 3₁₀-helix normal curve is slightly above the 2σ line.

https://doi.org/10.1371/journal.pone.0129674.g001

From the statistics for the helix scores h, g, π obtained on all the helices in 𝕊 assigned by the program dssp we set h_T = 20.0, g_T = 6.0, π_T = 14.0. They are used respectively in α,3₁₀ and π helix assignment. Similarly from the statistics for axis angle a obtained on the same set 𝕊 we set a_T = 20.0°, a_G = 10.0°, a_I = 20.0°. They are also used respectively in α,3₁₀ and π helix assignment. From the distributions shown in Table 2 and Fig 2, the default value for the threshold a_max (Fig 2b) is set to 40.0°, a value that is well beyond the distribution for a in the set of 382 ultra-high resolution structure each one has a resolution ≤ 1.0Å. A statistical analysis of the axis angle on all the helices assigned by the program dssp on set 𝕊 shows that more than 83% of the C-terminal residues of the helices have the axis angle less than 40.0° (S1 Fig) so the same threshold is used for the termination of α-helix assignment. The default value for threshold h_max is set to 160.0, well above h_T = 20.0 used for α-helix assignment because a merge step is only triggered for the part of helix that has large local deviations from a genuine helical curve. The thresholds a_max, h_max and δ_max decide whether two adjacent helices should be merged. We set δ_max = 0.3 based on a statistical analysis of the δ distribution on the same set 𝕊. The threshold δ_G that plays an important role in how far the C-terminus of a 3₁₀-helix is to be extended is set to 0.12, based on the statistics that ≤ 2.0% α-helices have a δ value larger than 0.12.

Download:

Table 2. The statistics of a, Δ and δ.

The data are obtained from a set of the dssp assigned α-helices on a set of 382 ultra-high resolution (≤ 1.0Å) x-ray structures. Each column presents respectively a data range and the number of residues inside that range. Both RMSDs, Δ and δ, have unit Å while a is in degree.

https://doi.org/10.1371/journal.pone.0129674.t002

Download:

Fig 2. A histogram of helix axis angle a (a) and the threshold a_max (b).

The x-axis in (a) is the angle in degree while the y-axis is the number of residues. In (b) the axis angle of residue 28 (colored in red and located in the middle of a segment, the top figure) in a protein (pdbid 4CXF) has a₂₈ = 37.72° while the angle of residue 53 (colored in red and located in the middle of a segment, the bottom figure) in a protein (pdbid 1SQG) has a₅₃ = 44.95°. With a threshold of a_max = 40.0°, the first segment is assigned by our algorithm as a single α-helix while the second segment is divided into two different helices. In contrast both segments are assigned as a single helix by dssp.

https://doi.org/10.1371/journal.pone.0129674.g002

3.2 The helix assignment

We have applied our algorithm as well as the dssp and stride programs to assign α, 3₁₀ and π helices for the 25,806 protein structures in ℕ. For each helix type, the total numbers of helices and residues assigned by each program are listed in Table 3. On set ℕ our algorithm assigns 178,104 α-helices with a length from 4 to 212 residues and the most frequently appeared helix has a length of 10 residues; 70,208 3₁₀-helices with a length from 3 to 14 residues with the most frequently appeared ones has a length of 3 residues; and finally 6,600 π-helices with a length from 5 to 18 residues and the most frequently appeared ones has a length of 5 residues. Due to the difficulty of obtaining a local copy for the programs palsse, stick, xtlsstr, kaksi and sst we have tested them on the set 𝕋 of 100 selected x-ray structures by uploading each one to a web server [23] to obtain their assignments (Table 4). The comparison with the program disicl is made on the same set 𝕋. We are not able to compare our algorithm with the rest of the published assignment programs because we could not get a working copy for any of them.

Download:

Table 3. Helix assignment on a residue and a helix basis by our algorithm, dssp and stride.

The third column for each helix type presents respectively the range in helix length and the length of the most frequently appeared helices.

https://doi.org/10.1371/journal.pone.0129674.t003

Download:

Table 4. The assignments on 𝕋 by our algorithm and the nine previous programs.

The assignments are made for a set of 100 x-ray structures with different resolutions. The first row is the total number of residues. All the other rows are the agreement between a pair of programs in percentage. The percentage is computed as where n is the number of residues assigned by both programs while n₁ and n₂ are respectively the numbers of residues assigned only by the first and second programs.

https://doi.org/10.1371/journal.pone.0129674.t004

3.3 The algorithmic implementation

We have implemented our helix assignment algorithm in C++ and included it as a module in our structure analysis and visualization program written in C++/Qt/openGL. The default values for the two parameters, δ_r and δ_p, required for the computations of the helix parameters r, p are set to 0.25 and the step size for both intervals, [r_m − δ_r, r_m + δ_r] and [p_m − δ_p, p_m + δ_p], is 0.01. The program is available upon the request.

4 Discussion

In this paper we have developed a two-step algorithm that follows the division of the assignment problem into two sub-problems: a minimization problem and a restraint satisfaction problem. The first step solves the minimization problem by a newly-developed helical curve fitting algorithm. From the best fit helical curves we then calculate helix score, axis angle and C_α RMSD for each residue that are in turn used as restraints for helix assignment. In the following we first discuss the rational and motivation behind the two-step algorithm and the two premises of the algorithm. Then we discuss the advantages of our algorithm over the previous programs through the detailed comparisons of the assignments made for both ℕ and 𝕋, and through structure classification of the assigned helices. Finally we illustrate the biological significance of the assignment of a helix to a particular type by showing a correlation between the π-helices assigned by our algorithm and protein-ligand binding sites.

4.1 A two-step solution to the helix assignment problem and the two premises

As has been well-documented before [18, 19] that there exist large discrepancies (up to 37%) among the previous helix assignment programs. These programs share two features: (1) none of them use a genuine helical curve as a model, and (2) they provide a one-step solution to the assignment problem formulated as a restraint satisfaction problem where the restraints could be either the hydrogen bond between backbone atoms or the geometrical restraints such as backbone ϕ/ψ angles, virtual C_α bond length and angle. Because of the error in the computed hydrogen bond energy or the variations in the geometrical restraints, the programs that use different types of restraints are expected to generate different assignments for the same protein (see Table 4 for examples). In contrast, we divide the assignment problem into two sub-problems with the first being a minimization problem. The algorithmic solution to the minimization problem generates a series of best fit helical curves that are in turn used as input to the assignment algorithm. The accurate computation of these helical curves leads to the high accuracy and consistency since unlike the previous geometrical restraint-based programs the strict requirement that all the helices must fit reasonably well to a helical curve greatly reduces the possible variation in the input restraints and that in turn leads to the structural uniformity of the assigned helices.

The statistical analyses of the helical parameters r, p, t for all the residues in 𝕊 show that in general their distributions for different types of secondary structure elements have almost no overlap (Fig 1j–1l). In addition, for each helix type the axis angle a and the RMSDs Δ, δ are usually rather small (Table 2). These observations (computational results) are consistent with the two premises of our assignment algorithm. They are the reasons why our algorithm could not only assign a backbone segment to a helix but also output its helix type.

4.2 The comparisons with the previous programs in helix assignment

In this section we compare the helix assignments by our algorithm, dssp and stride on ℕ (25,806 structures) as well as the assignments by our algorithm and nine previous programs on 𝕋 (100 structures). We also illustrate the differences in assignment by examples.

For the α-helices in ℕ, our assignment agrees very well with both dssp and stride (Table 5). Specifically the agreement with dssp is 95.3% on a residue basis and 95.1% on a helix basis, and with stride 95.4% and 90.4%. In addition the most frequently appeared helices from our algorithm and stride have the same length (10 residues) (Fig 3a). However, the most frequently appeared helices by dssp have a length of only 4 residues. This large discrepancy may originate from the fact that dssp relies only on two i, i + 4 hydrogen bonds for assigning a 4-residue helix but it is likely that two such bonds are not strong enough to constraint the C_αs to a helical curve and thus our algorithm does not assign them as a helix (Fig 3b). In addition to hydrogen bond, stride also uses backbone ϕ/ψ angles as restraint to exclude such 4-residue helices. In fact it seems that the criteria used by stride are so restrictive that it excludes most of the 4-residue helices assigned by both our algorithm and dssp (Fig 3a). Though the total numbers of 3₁₀-helices from the three programs, 68,093 (dssp), 70,974 (stride) and 70,208 (our algorithm), are rather similar (Table 3), there exist obvious differences at both residue and helix levels with the best agreement between our algorithm and either dssp or stride to be ≤ 60%. The discrepancies suggest that the few hydrogen bonds in a 3₁₀-helix whose average length is only 3.3 residues are not strong enough to constraint the C_αs to a helical curve. The same reason may explain the ≥ 25% difference in π-helix assignment between our algorithm and dssp since the most frequently appeared π-helices have a length of only 5 residues. For reasons unknown to us, stride fails to assign most π-helices: the agreement between our algorithm and stride is only 2.9%. The discrepancies also suggest that especially for 3₁₀ and π-helices the existence of the characteristic hydrogen bonding pattern between protein backbone atoms used by the hydrogen bond-based programs is not a sufficient condition for the backbone segment to fit well to a helical curve.

Download:

Table 5. The comparisons of helix assignment on ℕ by our algorithm, dssp and stride.

The third column (in boldface) is the agreement on a residue basis computed as where N is the total number of residues assigned by our program, and n the number of the residues assigned by both our algorithm and either dssp or stride. The agreement on a residue basis on π-helix assignment between our algorithm and stride is very poor, only 2.9%. The 4th-10th columns present various agreements (disagreements) on a helix basis. The columns A, B, C show respectively the agreement with at most one residue difference, with the exclusion of the N-terminal residue and the exclusion of the C-terminal residue. The column A+B+C (in boldface) sums up the agreements on a helix basis. The next two columns D and E present respectively the helices assigned only by our algorithm and the helices assigned only by either dssp or stride. The last column F shows the set of helices each one has been assigned by either dssp or stride as a single helix but is divided into at least two helices by our algorithm (see Figs 4 and 5). All the data are in percentage.

https://doi.org/10.1371/journal.pone.0129674.t005

Download:

Fig 3. The distributions of the lengths of the α-helices from our algorithm, dssp and stride (a), and an example of 4-residue α-helix by dssp(b).

The x-axis in (a) is the helix length while the y-axis is the number of the helices with that particular length. The two arrows point to the most frequently appeared helices assigned by dssp and by both our algorithm and stride. The right figure (b) depicts a dssp-assigned 4-residue α-helix in a protein (pdbid 1CC5) that is not assigned to a helix by our algorithm.

https://doi.org/10.1371/journal.pone.0129674.g003

In addition to set ℕ, a set 𝕋 of 100 protein structures with different resolutions have been selected for the comparisons with nine previous programs including both hydrogen bond-based and geometrical restraint-based. Except for the four programs dssp, stride, p-sea and disicl we are not able to get a local copy for the other five programs. Since we could only make p-sea to work on less than a half of the total structures in ℕ we only present its assignment on 𝕋. The comparison with the program disicl is made only on 𝕋. As shown in Table 4, our algorithm agrees with either dssp or stride better than the other programs do. The agreements between any pair of the six geometrical restraint-based programs (p-sea, kaksi, plasse, stick, xtlsstr and disicl) are between 61.4% and 82.0% and their agreements with the three hydrogen bond-based programs (dssp, stride and sst) are between 71.2% and 83.7%. The most recent geometrical restraint-based program disicl developed mainly for the analyses of the trajectories from molecular dynamics simulations performs below the average as judged by its assignment agreements with dssp (76.8%), our algorithm (77.7%) and the seven other programs (61.4%–79.5%). Our algorithm tends to assign a lower number of helices than the previous programs do likely because of the strict requirement that for a backbone segment to be a helix it must fit to a genuine helical curve. Compared with the other geometrical restraint-based programs, the helices assigned by our algorithm tend to be longer likely because of the merge step taken by our algorithm. For the comparisons on 𝕋 we do not separate the assigned helices into different types because the helix assignment downloaded from the web server does not have type information. In summary, if we use the assignment made by the program dssp as the standard, our algorithm is the most accurate among the six geometrical restraint-based programs with them we have compared.

The differences in assignment exist not only in the number of residues or helices been assigned but also in the details of assignment such as the division of a single helix by one program into several helices by another, helix type swap and difference in helix termination at its C-terminus. Specifically our algorithm may divide a single helix by either dssp or stride into several helices of possible different types (Table 5, Figs 4 and 5). Furthermore, an α-helix or a part of an α-helix assigned by either dssp or stride may be assigned as either a 3₁₀ or a π-helix or vice verse (Figs 4 and 5). Such division happens when the energy difference between the three pairs of residues, (i, i + 3), (i, i + 4) and (i, i + 5), is relatively small (S2 Fig). A hydrogen bond-based algorithm may have difficulty making a proper helix type assignment because of the error in the computed hydrogen bond energy. As shown in Fig 2b and Fig 4a, some of the α-helices assigned by dssp have large bends in the middle. Such helices are absent in our assignment because the residues in such highly-bended positions must have their local axis angle a ≥ a_max where a_max = 40.0° (Fig 2a). Less frequently, for some continuous backbone segments dssp may divide them into two helices but our algorithm assigns it as a single one (Fig 4b). Two examples shown in Fig 5 illustrate that (1) for some dssp assigned α-helices our algorithm tends to assign a part of them as a 3₁₀ helix, and (2) the difference in the assignment of the C-terminal regions of helices. The division of a single continuous α-helix assigned by the previous programs into several different types of helices though less intuitive at the first sight is in fact a more accurate description of a protein structure and as detailed late should be useful for the discovery of structure-function relationship in proteins.

Download:

Fig 4. The illustration of the differences in assignment by our algorithm and dssp.

In (a) a segment (residues 148–180) in a protein (pdbid 1DI1) assigned as a single α-helix by dssp is shown in the left with residue 163 colored in yellow. Our algorithm divides the same segment into three different helices (the right figure): α-helix–3₁₀–helix–α-helix. The first helix assigned by our algorithm starts with residue 150 since both residues 148 and 149 have h_i > h_T and stops at residue 163 (colored in yellow) since its h_i > h_T and no merge is triggered because that a_i = 85.12° > a_max = 40.0° for the three residues 164 (colored in red), 165 and 166. The residues 164–166 all have h_i > h_T, so the next helix starts with residue 167. In contrast to dssp our algorithm assigns the segment of residues 167–171 as a 3₁₀contentsline helix (colored in purple), and assigns the remaining residue 172–182 as an α-helix. Compared with dssp, the C-terminus has been extended by two residues (A181 and Q182, colored in blue). In (b) dssp divides the segment (347–358) in a protein (pdbid 3I32) into two helices: α-helix (347–353, green) and π-helix (354–358, red) based on the strength of hydrogen bonding interaction. In contrast, our algorithm assigns the whole segment (347–357) as a single α-helix but excludes the last residue 358.

https://doi.org/10.1371/journal.pone.0129674.g004

Download:

Fig 5. The illustration of the differences in assignment by our algorithm and dssp.

In (a) dssp assigns the entire segment (51–65, excluding P50) in a protein (pdbid 3OY9) as an α-helix. Our algorithm divides it into two helices: 3₁₀-helix (51–52, red) and α-helix (53–63, green, purple, yellow). The α-helix stops at N63 since the C_α RMSD δ values for residue 64 and 65 are respectively 0.541, 0.431, none of them less than d_max = 0.3. In contrast, the dssp assigned helix extends to residue 65. However, as shown in the left figure, the C_α coordinates of both residue 64 and 65 deviate clearly from a helical curve. In (b) a segment of residues 153–172 in a protein (pdbid 1MHY) is assigned as a single α-helix by dssp while our algorithm divides it into four helices: 3₁₀-helix(154–156, yellow)–α-helix(157–163, green)–3₁₀-helix(164–166, purple)–α-helix(167–171, green). However, a careful examination of the hydrogen bond energies for these residues in fact suggests that they could also be assigned to a 3₁₀-helix even by the dssp standard (S2 Fig).

https://doi.org/10.1371/journal.pone.0129674.g005

Though r, p, t, Δ distributions for the five groups (H, G, I, E and R) have almost no overlap (Fig 1j–1l), the largest peaks in the r, p, t distributions for the three helix types do overlap with the minor peaks in the r, p, t distributions for the residues in group R (peak B and C of Fig 1p, peak D of Fig 1q and peak F of Fig 1r). These overlaps may make it difficult for our algorithm (likely also for the previous programs) to distinguish a helix from a segment of protein backbone that is typically assigned as either a loop or a turn. To reduce the possibility of mistaking a turn or a loop as a helix, a term is added in the helix score (Eq 2) to guarantee that the helical curve not only fits optimally to a quadruple of C_αs locally but also fits sub-optimally to all the C_αs of an entire helix. Similarly the C_α RMSD δ is used as a restraint in 3₁₀-helix assignment to distinguish it from a turn or loop. The largest discrepancy in assignment by different programs is in the C-terminal region of a helix because of the structural similarity between the C-terminal region of a helix and a turn or loop. To improve the accuracy of the C-terminal assignment and to have consistent criteria for the termination of a helix we have used a C_α RMSD threshold (δ_G) for the termination of a 3₁₀ helix. In addition to the two thresholds h_max = 160.0 and a_max = 40.0° used for the merge of two adjacent α-helices and the termination of an α-helix, we also include a C_α RMSD threshold δ_max = 0.3 to make sure that the part of a helix to be merged does not deviate largely from a genuine helical curve.

In summary, one advantage of our algorithm is better assignment accuracy and consistency because it is required by our problem formulation and algorithm that all the helices must fit reasonably well to a helical curve. In contrast, for dssp and other hydrogen bond-based programs as long as there exists the characteristic hydrogen bonding pattern for a helix type the participating residues are assigned to that particular helix type. Different hydrogen bonding patterns are distinguished by the relative hydrogen bonding strength between the three pairs of residues (i, i + 3), (i, i + 4) and (i, i + 5). In the twilight region where the difference in strength is small, the hydrogen bond-based programs have difficulty making correct type assignment. Similarly for the geometrical restraint-based programs the error and variation in the input restraints reduce their assignment accuracy and consistency.

4.3 A clustering analysis of the assigned helices

One key feature that distinguishes our algorithm from the previous programs is that the helices assigned by our algorithm must fit to a helical curve reasonably well and thus are expected to be structurally uniform. To confirm the uniformity we have classified some sets of the helices assigned by our algorithm, dssp and p-sea on ℕ. The clustering is performed on sets of helices having the same length. Six sets of clusters on α-helices are shown in Fig 6. The first three sets of helices have a length of 12 residues and the second three sets a length of 24 residues. In general, the number of clusters in a set with the same helix length assigned by our algorithm is about the half of the number of clusters in the set by dssp. Though the total numbers of clusters by our program and p-sea are similar, p-sea is able to assign only a half of the helices with the same length. As shown in Fig 6, it is obvious that the structures of the helices assigned by p-sea are less uniform than those by our algorithm as is evidenced by the appearance in the former of several clusters consisting of only outliers. Please see S3 Fig of the SI for the comparisons of the clusters on four sets of 3₁₀ and π-helices assigned respectively by our algorithm and dssp. In summary, the helices assigned by our algorithm are structurally more uniform than those by the previous algorithms.

Download:

Fig 6. The clusters of α–helices by our algorithm, dssp and p-sea.

The sets of helices in the left have a length of 12 residues while the sets in the right 24 residues. The 12-residue set (11,756 helices) and 24-residue set (1,211 helices) by our algorithm are classified respectively into 12 and 17 clusters. The dssp assigned 12-residue set (12,631 helices) and 24-residue set (1,285 helices) are classified respectively into 21 and 35 clusters while the p-sea assigned 12-residue set (5,306 helices) and 24-residue set (574 helices) are classified respectively into 10 and 24 clusters. The clusters are produced using our geometric clustering algorithm [22]. The RMSD threshold for clustering is 1.5Å.

https://doi.org/10.1371/journal.pone.0129674.g006

4.4 The correlation between π-helices and protein-ligand binding sites

Our algorithm is able to assign a helix to a particular type. As detailed above for α-helices, the agreement between our algorithm and either dssp or stride is excellent. However, for both 3₁₀ and π helices the agreements are between 56.9% and 74.1%. An interesting application to structure-function relationship [26] is to see whether there exist any correlation between the location of a π-helix and a protein-ligand binding site. Out of the 25,806 protein structures in ℕ our algorithm has assigned 6,600 π-helices from 4,329 protein structures, 3,811 of them have a bound ligand. We compute the number of π-helices that is within a certain distance range of the bound ligand. The distance between a ligand and a π-helix is defined as the shortest distance between any ligand atom and any protein atom that belongs to the π-helix. As shown in Fig 7a there exists a strong correlation between the location of a π-helix and a protein-ligand binding site: 38.6 percent of all the π-helices assigned by our algorithm are less than 6.0Å away from a ligand binding site. Furthermore, as shown in Fig 7b, compared with the program dssp our algorithm is able to assign more of such π-helices. Such correlation should be helpful for the discovery of structure-function relationship in proteins.

Download:

Fig 7. A histogram of a π-helix vs a protein-ligand binding site.

Figure (a) shows a histogram of the π-helix assigned by our algorithm vs a protein-ligand site. The x-axis is the distance (in Å) between the ligand and a π-helix while the y-axis is the number of π-helices. Figure (b) shows an example illustrating the difference in π-helix assignment by our algorithm and dssp. Our algorithm assigns three π-helices labeled as π₁, π₂ and π₃, all of them are in the ligand binding site but dssp fails to assign π₁ though it is able to identify both π₂ and π₃. In this figure, α-helices are in green, 3₁₀-helices are in orange and the ligand is shown as spheres.

https://doi.org/10.1371/journal.pone.0129674.g007

5 Conclusion

We have divided the protein helix assignment problem into a minimization problem and a restraint satisfaction problem and developed an assignment algorithm that follows rigorously the geometry of helix. The application of the algorithm to the set of protein structures available in the current version of the PDB and the detailed comparisons of assignments made by the algorithm and the nine previous programs prove that our algorithm is able to assign more accurately not only α-helices but also 3₁₀ and π helices as well as the left-handed helices. The clustering analyses of the helices assigned by our algorithm and the previous programs confirm that the helices from our algorithm are structurally more uniform than those by the previous programs. The accurately assigned helices and the clusters as well as the common structural features shared by all the helices in a cluster should be particularly useful for protein structure classification and prediction as well as secondary structure prediction. The accurate assignment of a helix to a particular type should be helpful for the discovery of structure and function relationship in proteins.

Supporting Information

S1 Fig. The distributions of C_α RMSD δ and helix axis angle of the residues in the C-terminal regions of dssp assigned helices.

The x-axis is the helix axis angle in degree, the y-axis is the C_α RMSD δ in Å. The region inside the rectangle defined by the two intervals, [0.0°,40.0°] for the axis angle and [0.0,0.1Å] for C_α RMSD δ, includes 83.5% of all the 29,014 data points computed on set 𝕊. What shown here is the data for residue i for the two C-terminal residues i, i + 1 of a dssp-assigned helix.

https://doi.org/10.1371/journal.pone.0129674.s001

(EPS)

S2 Fig. The hydrogen bond energy used by dssp for the helix assignment.

Figure S2 is extracted from the dssp assignment for the protein 1MHY (pdbid). This example illustrates that in the twilight region where the differences in hydrogen bond strength are small, a hydrogen bond-based program such as dssp may have difficulty making correct helix type assignment. The three red rectangles indicate donor interaction while the three blue ones acceptor interaction. The data is from dssp for the protein 1MHY (pdbid) the same protein as shown in Fig 5b of the main paper. The program dssp assigns all the residues from 152 to 172 as a single α-helix while our algorithm assigns the segment from 154 to 156 and the segment from 164 to 166 as two different 3₁₀-helices. A careful examination of the hydrogen bond energies for the residues in the six rectangles in fact suggests that they may also be assigned to a 3₁₀-helix even by the dssp standard.

https://doi.org/10.1371/journal.pone.0129674.s002

(EPS)

S3 Fig. A clustering analysis of 3₁₀ and π-helices.

The figure shows the clusters generated by the geometric clustering algorithm on sets of the 3₁₀ and π-helices assigned respectively by our algorithm and dssp. The two sets of 3₁₀ helices in the left (a) have a length of 4 residues while the two sets of π-helices in the right (b) have 5 residues. The 4-residue set (8,861 3₁₀-helices) and 5-residue set (4,554 π-helices) by our algorithm are classified respectively into 4 and 7 clusters (the two upper figures). The dssp assigned 4-residue set (8,563 3₁₀-helices) and 5-residue set (5,365 π-helices) are classified respectively into 29 and 25 clusters (the two lower figures). The clusters are produced using our geometric clustering algorithm [22]. The RMSD thresholds for the clustering of 3₁₀-helices and π-helices are respectively 0.3Å and 0.75Å.

https://doi.org/10.1371/journal.pone.0129674.s003

(EPS)

Author Contributions

Conceived and designed the experiments: CC LW. Performed the experiments: CC SX LW. Analyzed the data: CC SX. Wrote the paper: LW CC.

References

1. Pauling L, Corey RB. Configurations of polypeptide chains with favored orientations around single bonds: two new pleated sheets. Proceedings of the National Academy of Sciences of the United States of America. 1951;37(11):729. pmid:16578412
- View Article
- PubMed/NCBI
- Google Scholar
2. Kendrew J, Dickerson R, Strandberg B, Hart R, Davies D, Phillips D, et al. Structure of myoglobin. Nature. 1960;185(422):427–1960.
- View Article
- Google Scholar
3. Dickerson RE, Drew HR, Conner BN, Wing RM, Fratini AV, Kopka ML. The anatomy of A-, B-, and Z-DNA. Science. 1982;216(4545):475–485. pmid:7071593
- View Article
- PubMed/NCBI
- Google Scholar
4. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–2637. pmid:6667333
- View Article
- PubMed/NCBI
- Google Scholar
5. Frishman D, Argos P. Knowledge-based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics. 1995;23(4):566–579.
- View Article
- Google Scholar
6. Fodje M, Al-Karadaghi S. Occurrence, conformational features and amino acid propensities for the π-helix. Protein engineering. 2002;15(5):353–358. pmid:12034854
- View Article
- PubMed/NCBI
- Google Scholar
7. Konagurthu AS, Lesk AM, Allison L. Minimum message length inference of secondary structure from protein coordinate data. Bioinformatics. 2012;28(12):i97–i105. pmid:22689785
- View Article
- PubMed/NCBI
- Google Scholar
8. Levitt M, Greer J. Automatic identification of secondary structure in globular proteins. Journal of molecular biology. 1977;114(2):181–239. pmid:909086
- View Article
- PubMed/NCBI
- Google Scholar
9. Richards FM, Kundrot CE. Identification of structural motifs from protein coordinate data: Secondary structure and first-level supersecondary structure*. Proteins: Structure, Function, and Bioinformatics. 1988;3(2):71–84.
- View Article
- Google Scholar
10. Labesse G, Colloc’h N, Pothier J, Mornon JP. P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins. Computer Applications in the Biosciences. 1997; 291–295. pmid:9183534
- View Article
- PubMed/NCBI
- Google Scholar
11. Sklenar H, Etchebest C, Lavery R. Describing protein structure: a general algorithm yielding complete helicoidal parameters and a unique overall axis. Proteins: Structure, Function, and Bioinformatics. 1989;6(1):46–60.
- View Article
- Google Scholar
12. Majumdar I, Krishna SS, Grishin NV. PALSSE: A program to delineate linear secondary structural elements from protein structures. BMC bioinformatics. 2005;6(1):202. pmid:16095538
- View Article
- PubMed/NCBI
- Google Scholar
13. Taylor WR. Defining linear segments in protein structure. Journal of molecular biology. 2001;310(5):1135–1150. pmid:11502001
- View Article
- PubMed/NCBI
- Google Scholar
14. King SM, Johnson WC. Assigning secondary structure from protein coordinate data. Proteins: Structure, Function, and Bioinformatics. 1999;35(3):313–320.
- View Article
- Google Scholar
15. Martin J, Letellier G, Marin A, Taly JF, De Brevern AG, Gibrat JF. Protein secondary structure assignment revisited: a detailed analysis of different assignment methods. BMC Structural Biology. 2005;5(1):17. pmid:16164759
- View Article
- PubMed/NCBI
- Google Scholar
16. Nagy G, Oostenbrink C. Dihedral-Based Segment Identification and Classification of Biopolymers I: Proteins. Journal of Chemical Information and Modeling. 2014;54(1):266–277. pmid:24364820
- View Article
- PubMed/NCBI
- Google Scholar
17. Cuff JA, Barton GJ. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics. 1999;34(4):508–519.
- View Article
- Google Scholar
18. Andersen CA, Rost B. Secondary structure assignment. Structural Bioinformatics. 2009;44:459–484.
- View Article
- Google Scholar
19. Colloc’h N, Etchebest C, Thoreau E, Henrissat B, Mornon JP. Comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment. Protein engineering. 1993;6(4):377–382. pmid:8332595
- View Article
- PubMed/NCBI
- Google Scholar
20. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR, et al. The protein data bank. European Journal of Biochemistry. 1977;80(2):319–324. pmid:923582
- View Article
- PubMed/NCBI
- Google Scholar
21. Novotny M, Kleywegt GJ. A survey of left-handed helices in protein structures. Journal of molecular biology. 2005;347(2):231–241. pmid:15740737
- View Article
- PubMed/NCBI
- Google Scholar
22. Xu S, Zou S, Wang L. A Geometric Clustering Algorithm and Its Applications to Structural Data. In: Research in Computational Molecular Biology. Springer; 2014. 356–370.
23. Klose DP, Wallace BA, Janes RW. 2Struc: the secondary structure server. Bioinformatics. 2010;26(20):2624–2625. pmid:20739308
- View Article
- PubMed/NCBI
- Google Scholar
24. Goriely A, Hausrath A, Neukirch S. The differential geometry of proteins and its applications to structure determination. Biophysical Reviews and Letters. 2008;3(1–2):77–101.
- View Article
- Google Scholar
25. Shimanouchi T, Mizushima Si. On the helical configuration of a polymer chain. The Journal of Chemical Physics. 1955;23(4):707–711.
- View Article
- Google Scholar
26. Wang L, Li Y, Yan H. Structure-function relationships of cellular retinoic acid-binding proteins: quantitative analysis of the ligand binding properties of the wild-type proteins and site-directed mutants. Journal of Biological Chemistry. 1997;272(3):1541–1547. pmid:8999826
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Pauling L, Corey RB. Configurations of polypeptide chains with favored orientations around single bonds: two new pleated sheets. Proceedings of the National Academy of Sciences of the United States of America. 1951;37(11):729. pmid:16578412
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Kendrew J, Dickerson R, Strandberg B, Hart R, Davies D, Phillips D, et al. Structure of myoglobin. Nature. 1960;185(422):427–1960.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Dickerson RE, Drew HR, Conner BN, Wing RM, Fratini AV, Kopka ML. The anatomy of A-, B-, and Z-DNA. Science. 1982;216(4545):475–485. pmid:7071593
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–2637. pmid:6667333
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Frishman D, Argos P. Knowledge-based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics. 1995;23(4):566–579.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref6] 6. Fodje M, Al-Karadaghi S. Occurrence, conformational features and amino acid propensities for the π-helix. Protein engineering. 2002;15(5):353–358. pmid:12034854
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref7] 7. Konagurthu AS, Lesk AM, Allison L. Minimum message length inference of secondary structure from protein coordinate data. Bioinformatics. 2012;28(12):i97–i105. pmid:22689785
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Levitt M, Greer J. Automatic identification of secondary structure in globular proteins. Journal of molecular biology. 1977;114(2):181–239. pmid:909086
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Richards FM, Kundrot CE. Identification of structural motifs from protein coordinate data: Secondary structure and first-level supersecondary structure*. Proteins: Structure, Function, and Bioinformatics. 1988;3(2):71–84.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref10] 10. Labesse G, Colloc’h N, Pothier J, Mornon JP. P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins. Computer Applications in the Biosciences. 1997; 291–295. pmid:9183534
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Sklenar H, Etchebest C, Lavery R. Describing protein structure: a general algorithm yielding complete helicoidal parameters and a unique overall axis. Proteins: Structure, Function, and Bioinformatics. 1989;6(1):46–60.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref12] 12. Majumdar I, Krishna SS, Grishin NV. PALSSE: A program to delineate linear secondary structural elements from protein structures. BMC bioinformatics. 2005;6(1):202. pmid:16095538
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref13] 13. Taylor WR. Defining linear segments in protein structure. Journal of molecular biology. 2001;310(5):1135–1150. pmid:11502001
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref14] 14. King SM, Johnson WC. Assigning secondary structure from protein coordinate data. Proteins: Structure, Function, and Bioinformatics. 1999;35(3):313–320.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref15] 15. Martin J, Letellier G, Marin A, Taly JF, De Brevern AG, Gibrat JF. Protein secondary structure assignment revisited: a detailed analysis of different assignment methods. BMC Structural Biology. 2005;5(1):17. pmid:16164759
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Nagy G, Oostenbrink C. Dihedral-Based Segment Identification and Classification of Biopolymers I: Proteins. Journal of Chemical Information and Modeling. 2014;54(1):266–277. pmid:24364820
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref17] 17. Cuff JA, Barton GJ. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics. 1999;34(4):508–519.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref18] 18. Andersen CA, Rost B. Secondary structure assignment. Structural Bioinformatics. 2009;44:459–484.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref19] 19. Colloc’h N, Etchebest C, Thoreau E, Henrissat B, Mornon JP. Comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment. Protein engineering. 1993;6(4):377–382. pmid:8332595
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref20] 20. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR, et al. The protein data bank. European Journal of Biochemistry. 1977;80(2):319–324. pmid:923582
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref21] 21. Novotny M, Kleywegt GJ. A survey of left-handed helices in protein structures. Journal of molecular biology. 2005;347(2):231–241. pmid:15740737
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref22] 22. Xu S, Zou S, Wang L. A Geometric Clustering Algorithm and Its Applications to Structural Data. In: Research in Computational Molecular Biology. Springer; 2014. 356–370.

[ref23] 23. Klose DP, Wallace BA, Janes RW. 2Struc: the secondary structure server. Bioinformatics. 2010;26(20):2624–2625. pmid:20739308
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref24] 24. Goriely A, Hausrath A, Neukirch S. The differential geometry of proteins and its applications to structure determination. Biophysical Reviews and Letters. 2008;3(1–2):77–101.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref25] 25. Shimanouchi T, Mizushima Si. On the helical configuration of a polymer chain. The Journal of Chemical Physics. 1955;23(4):707–711.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref26] 26. Wang L, Li Y, Yan H. Structure-function relationships of cellular retinoic acid-binding proteins: quantitative analysis of the ligand binding properties of the wild-type proteins and site-directed mutants. Journal of Biological Chemistry. 1997;272(3):1541–1547. pmid:8999826
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

Figures

Abstract

1 Introduction

2 The data set and the algorithm

2.1 The data set

2.2 The curve fitting algorithm and helix assignment

2.2.1 The helical curve fitting algorithm.

2.2.2 The computation of three helix scores and helix axis angle and Cα RMSD.

2.2.3 The helix assignment algorithm.

2.3 Helix classification

3 Results

3.1 The statistical analyses of helical parameters and RMSDs

3.2 The helix assignment

3.3 The algorithmic implementation

4 Discussion

4.1 A two-step solution to the helix assignment problem and the two premises

4.2 The comparisons with the previous programs in helix assignment

4.3 A clustering analysis of the assigned helices

4.4 The correlation between π-helices and protein-ligand binding sites

5 Conclusion

Supporting Information

S1 Fig. The distributions of Cα RMSD δ and helix axis angle of the residues in the C-terminal regions of dssp assigned helices.

S2 Fig. The hydrogen bond energy used by dssp for the helix assignment.

S3 Fig. A clustering analysis of 310 and π-helices.

Author Contributions

References

2.2.2 The computation of three helix scores and helix axis angle and C_α RMSD.

S1 Fig. The distributions of C_α RMSD δ and helix axis angle of the residues in the C-terminal regions of dssp assigned helices.

S3 Fig. A clustering analysis of 3₁₀ and π-helices.