Using First Passage Statistics to Extract Environmentally Dependent Amino Acid Correlations

Benjamin D. Greenbaum; Pradeep Kumar; Albert Libchaber

doi:10.1371/journal.pone.0101665

Abstract

In this work, we study the first passage statistics of amino acid primary sequences, that is the probability of observing an amino acid for the first time at a certain number of residues away from a fixed amino acid. By using this rich mathematical framework, we are able to capture the background distribution for an organism, and infer lengths at which the first passage has a probability that differs from what is expected. While many features of an organism's genome are due to natural selection, others are related to amino acid chemistry and the environment in which an organism lives, constraining the randomness of genomes upon which selection can further act. We therefore use this approach to infer amino acid correlations, and then study how these correlations vary across a wide range of organisms under a wide range of optimal growth temperatures. We find a nearly universal exponential background distribution, consistent with the idea that most amino acids are globally uncorrelated from other amino acids in genomes. When we are able to extract significant correlations, these correlations are reliably dependent on optimal growth temperature, across phylogenetic boundaries. Some of the correlations we extract, such as the enhanced probability of finding, for the first time, a cysteine three residues away from a cysteine or glutamic acid two residues away from an arginine, likely relate to thermal stability. However, other correlations, likely appearing on alpha helical surfaces, have a less clear physiochemical interpretation and may relate to thermal stability or unusual metabolic properties of organisms that live in a high temperature environment.

Citation: Greenbaum BD, Kumar P, Libchaber A (2014) Using First Passage Statistics to Extract Environmentally Dependent Amino Acid Correlations. PLoS ONE 9(7): e101665. https://doi.org/10.1371/journal.pone.0101665

Editor: Dmitri Boudko, Rosalind Franklin University, United States of America

Received: May 22, 2014; Accepted: June 6, 2014; Published: July 7, 2014

Copyright: © 2014 Greenbaum et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. We use whole genome sequence data. All genomes were downloaded from the UCSC genome browser and the UCSC microbial genome browser. The complete list of organisms is listed in our Supplementary Table 1.

Funding: The work was supported by the Simons Center for Systems Biology at the Institute for Advanced Study. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

First passage statistics provide a natural mathematical framework for analyzing the likelihood of the first occurrence of an event after some initial event [1]. While many analytical results have been derived using these distributions in several fields, here we present a novel application of these distributions to a problem in genome distributions. Namely, what is the best method, both in terms of the underlying mathematics and empirical application to sequence datasets, to infer the set of amino acid correlations in proteins that are dependent on the environment in which they function.

The genome sequencing of many extremophiles has created an opportunity to probe basic, yet practical questions about how an organism's physiochemical environment affects its genome [2], [3]. Extremophiles have been sequenced across vast evolutionary distances, representing a broad range of environmental conditions. Hence these organisms provide a deep-field lens for resolving how variations in physiochemical environment alter genome characteristics. Motivated by this, several authors have noted the effects of optimal growth temperature (OGT) on various amino acid features [4]–[9].

Globally, the frequencies of most amino acids are unrelated to the context of the amino acid words in which they appear [10]. Moreover, one can show that the observed OGT dependence of amino acid words can often be explained in terms of the OGT dependence of individual amino acids or small subwords [11]. Consequently, we argue that first passage distributions are a natural mathematical language for inferring environmentally dependent correlations between discontiguous residues. In the genomic context, these distributions are the probability of finding an amino acid for the first time at a specific number of residues away from a given amino acid.

We observe a universal exponential background distribution for amino acid first passage distributions across a set of 76 organisms, chosen to represent the range of well-characterized optimal growth temperatures among organisms with fully sequenced proteomes. The exponential background distribution implies that typically there is little genome wide correlation between amino acids. We use this observation to infer an empirical exponential background distribution - essentially inferring the decay parameter. We can then discover non-random correlations when there are observed first passage lengths that vary significantly from the background distribution. With this method, we hypothesize that we can reliably extract a set of environmentally dependent correlations, and in fact of all of those correlations we deem significant are OGT dependent. Some likely relate to known thermal adaptations, such as disulfide bonding and salt-bridge formation. Yet our approach also captures unexplained, OGT-dependent effects. Hence careful exploration of these issues may yield unforeseen phenomena from extremophile genomes.

Results

First Passage Distributions

We analyzed first passage distributions across the entire proteomes of 76 organisms (Table S1) [12]–[13]. These organisms were chosen so that a wide range of organisms with both fully sequenced proteomes and experimentally well-characterized OGTs were included. As a result, psychrophiles, mesophiles and thermophiles are all well sampled across the known range of organism growth temperatures. We wanted to examine correlations directly dependent on OGT. First passage statistics were chosen to take into account the OGT dependence of amino acids in the background distribution, while possessing the statistical power to resolve discontiguous correlations over large lengths. These distributions are defined as the probability that the first occurrence of amino acid occurs residues away from amino acid without having occurred beforehand. This is denoted by

where denotes the residue a distance from [1].

For a given chain one can calculate the typical first passage distribution in either the N-terminus or C-terminus direction, which we respectively denote by and . If amino acids were uncorrelated from their neighbors, this would reduce to a geometric distribution, the discrete analog of an exponential distribution [14]. In this case the probability would only depend on the frequency of and should, therefore, reflect the OGT dependent properties of individual amino acids, allowing the OGT dependence of outliers to be clearly established. However, while many genome wide amino acid probabilities are independent, some probabilities are not [10]. Our approach addresses that possibility. As shown in the Methods Section, the log of our distribution often becomes linear after a few residues of separation, with a slope dependent on and not . We therefore assume as our theoretical background distribution for , where is derived from the best linear fit to the logarithm of the first passage distribution versus . Practically, we derive it over values of less than 50, due to noise induced by finite gene size.

As an example Fig. 1 shows the number of first passages as a function of for CC and RE for two organisms, a mesophile, humans, and the thermophile T. petrophila. In both cases, and stand out from the background distribution. In the human genome, which is both longer and has longer genes, additional structures at larger appear. However, the relative height of the peak to the background distribution is higher for the thermophile, a pattern we explore further.

Download:

Figure 1. Number of first passages from cysteine to cysteine in (a) T. petrophila and (b) human proteomes, and from arginine to glutamic acid in (c) T. petrophila and (d) human proteomes.

https://doi.org/10.1371/journal.pone.0101665.g001

Previous authors have explored different measures of pairwise correlation between residues in proteomes [15]–[16]. Liang, et al. calculated the frequency of chains of length beginning with amino acid that terminate with amino acid , divided by the frequency of in the proteome. They recover a small subset of the correlations we find via first-passage statistics, with no measure of OGT dependence for individual correlations. However, they also declare many correlations significant which would not be considered significant by our approach, suggesting a higher false-positive rate in addition to less sensitivity. We believe this is because many of the amino acid correlations they noted, that we would deem insignificant, have individually OGT dependent frequencies. As our approach takes such effects into account explicitly by fitting the empirical background distribution rather than using a model clearly related to individual amino acid frequencies, these correlations would not be deemed significant.

Rosato, et al. calculated a similarly motivated odds ratio by comparing the number of times amino acid occurs residues away from amino acid and dividing this by the expected number of times this would occur in a random proteome with the same length, number of proteins, and amino acid frequencies [17]. The only non-nearest neighbor correlation they were able to identify was that of a cysteine three residues away from another cysteine, whereas our approach reveals many other sources of such discontiguous correlation, all of which correlate to OGT.

Our method does not assume individual amino acid independence, even though that is often a reasonable assumption, and separates true discontiguous correlations from those due to many small words by excluding events where previously occurred. We believe this is a key advantage over previous approaches. It derives significance from deviations to the long-range exponentiality of these distributions, even when the geometric distribution built from individual amino acid frequencies is inadequate. As shown in the Methods Section, typically depends on and varies with OGT when the frequency of depends on OGT, and does not depend on . If is more abundant at high OGT amino acids will typically pass to faster, and if it is less abundant more slowly. is a strong exception. Though less frequent at high OGT, cysteines average return length is faster - indicating that as cysteines become rare having them nearby at high OGT is more significant.

As a result of our improved background, all of our significant deviations from the expected exponential distribution depend on OGT. We extract these events from the logarithm of the ratio of the empirical value of to its expected value from the underlying exponential background, termed . This quantity is essentially the logarithm of the amplitude of the peaks, such as those demonstrated in Figure 1, above the background distribution – representing the logarithm of a signal to noise ratio. Table 1 looks at the most commonly over-represented pairs, while Table 2 looks at the same information for under-represented pairs. Those listed have a logarithm of the amplitude height above background that is at least three standard deviations from the from the exponential background distribution within an individual organism's proteome for a minimum of 25 of the organisms studied. is the most significant, by a substantial margin, both in magnitude and number of organisms in which it is over-represented. Several over-represented pairs fall into categories, such as having a positively and negatively charged side chain (RE, KE, ER, EK). Leucine, the amino acid whose frequency is higher than expected from its mass, is frequently in over-represented pairs (LE, LQ, LK, LR) [10].

Download:

Table 1. Over-represented First Passage Statistics.

https://doi.org/10.1371/journal.pone.0101665.t001

Download:

Table 2. Under-represented First Passage Statistics.

https://doi.org/10.1371/journal.pone.0101665.t002

We examine in detail a set of correlations along the N-Terminal direction, all of which have C-Terminal analogs. We focus on four cases from Table 1: , (which has similar behavior to ), (the most commonly over-represented leucine containing pair), and , a correlation previously unobserved to our knowledge. These pairs have the highest peak values above background and have distinct behavior from each other.

Dependence of Quantities on Habitat

To determine which signals vary with OGT, we plot, in Figure 2, the OGT for all organisms as a function of the logarithm of the peak amplitudes above background. We also calculate the linear correlation (Pearson) and Spearman correlation for these quantities, along with p-values, as described in the methods section. Figure 2 shows the dependence between OGT and with a 0.7054 linear correlation (1.1267×10⁻¹² p-value) and 0.6993 Spearman correlation (2.1146×10⁻¹² p-value), with a 0.7724 linear correlation (1.1267×10⁻¹⁶ p-value) and 0.7841 Spearman (5.4453×10⁻¹⁷ p-value), and with a 0.8013 linear correlation (3.5598×10⁻¹⁸ p-value) and 0.8155 Spearman (2.9905×10⁻¹⁹ p-value). For , OGT dependence holds for all non-eukaryotes. Eukaryotes are in the far left cluster. Previous results did not show a measure of strength of correlation between OGT and a signal to noise parameter, which would be analogous to the quantity we define below. The figures shown in references [15]–[17] indicate that these approaches show a noisier dependence between the statistical quantities defined by those authors and OGT.

Download:

Figure 2. OGT as a function of the logarithm of the real to expected first passage probability, essentially the non-random parts of the amplitudes for peaks such as those in Figure 1.

These are plotted for (a) CC at three residues, (b) RE at three residues, (c) LE at two residues, and (d) LK at two residues, where blue circles indicate OGT ranging from 5°C to about 60°C, and red circles represent hyperthermophilic organisms with OGT above 60°C.

https://doi.org/10.1371/journal.pone.0101665.g002

The significance of has clear interpretations. Increased clustering of cysteines may lead to an increase in disulfide bonds that can be used more frequently per residue for thermal stability or correlated effects from metal-ion binding are more prevalent [18]–[26]. As noted, cysteines become rarer at high OGT, yet seem to retain their importance for disulfide bonding [10]. For instance, many high-temperature environments tend to be more acidic. Since low pH inhibits thiol-disulfide exchange, one may expect that cysteine clustering would increase stability in the proteome of an organism in an acidic environment. Hence, there is an important tension between the rarity of cysteines at high OGT, and their potential increase in importance as a stabilizier.

Less noted, is the OGT dependence of the . The two are oppositely charged, and may represent the stabilization of proteins via salt-bridge formation [27]–[30]. Intriguingly, we find that a great deal of information can be drawn from the unusual metabolism of organisms farthest from the best-fit line between and OGT. Those furthest from that line in the upper half-plane are S. azorense and I. aggregans, and in the lower half-plane are G. metallireducens and G. sulfurreducens. This suggests a possible method by which unusual metabolic properties could be uncovered by deviation from a correlation line. Both the interpretation of why that particular amino acid correlation is proportional to OGT, and why sulfur respiration seems reflected in outliers, contain biological insight. While the former likely relates to an organism's utilization of salt-bridge formation, the later suggests a statistical measure for prediction of novel metabolism, which has practical uses such as in microbial cleaning of contaminated soil [31]–[32].

has the strongest linear correlation with OGT. Surprisingly, the plot of versus OGT (Fig. 2d), which has a slightly weaker, though highly significant, temperature dependence (0.6278 linear with 1.2765×10⁻⁹ p-value, 0.6042 Spearman with 7.5086×10⁻⁹ p-value), clusters these organisms into two main groups. One set, in red, contains most thermophiles while the other, colored blue, contains lower OGT organisms. This suggests a genome wide statistical property can classify thermophiles from non-thermophiles. Moreover, the quantity lacks an obvious interpretation.

Disscusion

We have shown that first passage distributions identified multiple OGT dependent statistical measures, indicating that they are a natural language for the inference of environmentally dependent amino acid correlations. We found a general exponential decay of first passage distributions. As indicated in Table S2, the inferred exponential first passage distribution often is, though not always, what one would expect if an amino acid is uncoupled from its neighbors.

We believe this is analogous to the rapid mixing that can generate exponential first passage distributions in a Markov Chain [33]–[35]. When an amino acid is not independent, it could appear in small, weak local groupings, enhancing the probability of staying near a small set of states before undergoing a transition to a forbidden state, making the decay time longer when independence is violated [33]. Moreover, in microbes, noise due to small genome size may mask richer behavior such as short-range deviations from exponentiality, resulting in stretched exponential distributions. Future studies should also compare the OGT dependent discontiguous correlations found by this measure to other approaches to inferring long-range correlations from primary amino acid sequences [36]–[37].

First passage distributions identified multiple OGT dependent statistical measures, many of which were unknown or previously less well characterized. Deviants from the general pattern of exponential decay indicate significant correlations that appear to be related to physical interactions such as disulfide bonding, from an enhancement of cysteines three amino acids away from cysteine, and salt-bridge formation, from sulfur metabolism. Moreover they identified a set of OGT dependent measures lacking a clear physical explanation, implying that this approach may be used to identify temperature sensitive amino acid correlations to uncover important properties of proteins in extreme environments.

Other than the cysteine peak, likely related to disulfide bonding, the first passage statistics singled out in Table 2a may also indicate specific enhancements of alpha helices that appear on the surface of a protein [38]. The significant values at a residue distance of two, one of which was the stringest observed correlation, all contain one hydrophobic and one hydrophilic residue, likely indicating one residue facing the inside and one facing the outside of a protein alpha helix. For those at length three and four, other than the cysteine peak at three residues, all residues are polar (REK), indicating alpha-helices facing the outside of a protein. In both cases, there would be an interaction with the external environment generating the observed OGT dependence, along with predictions about which specific residue patterns are enhanced. Further sequencing and characterization of extremophile habitats, may uncover new, unexpected effects, while furthering our understanding of what is a typical protein in a given environment.

Methods

Correlations

In general correlation coefficients were calculated using the “corr” functions in Matlab for the Pearson and Spearman correlation coefficients. P-values were calculated using Student's t-test for the Pearson correlation and a permutation test for the Spearman correlation, also using the Matlab “corr” function. Typically both coefficients are quite close in value. When no other method is indicated in the text an “r” value indicates the Pearson correlation.

Empirical Evidence for Exponential Distribution

To test the hypothesis that the appropriate background distribution for the amino-acid first passage distributions defined in the text is indeed exponential, we performed the following set of tests. Each test was performed independently for each organism. Organisms were never compared under the same statistical test for these p-value calculations.

For each first passage distribution, , we calculated the linear correlation coefficient between the value of and the logarithm of the first passage probability. Again, this was done separately for each organism. A p-value was assigned to each correlation to determine the significance of that linear correlation using Student's t-distribution for a transformation for the correlation. Both tasks performed using the function “corrcoef” in Matlab, using Pearson correlation.

In Table S2, for , the above mentioned p-values are explored for each of the 400 amino acid pairs that were tested separately for each organism. Even if multiple hypothesis testing is strictly taken into account, by treating each of the 76 cases of 400 hypothesis separately for a p-value cutoff of 0.05/30400 for a -log(p-value) of 13.318, there are only a 11 cases where the linear correlation is insignificant. All of them involve low frequency amino acids, and therefore may be due to large fluctuations. The minimum, maximum and medium values of -log(p-value) are shown in Table S2 for the 400 possible values of . Of the significant values, eight of them are between tryptophan and cysteine, two are between cysteine and tryptophan, and one is between methionine and cysteine. All occur in extremophiles where cysteines are rare and have large fluctuations.

Also shown in this table are the estimates for the average first passage, as derived from the background distribution, the standard deviation, and minimum and maximum values. Specifically we show the case . In most cases the values are at least tens of residues (than the typical length at which amino acid word statistics are significant [10]), and depend on only. To establish the hypothesis that dependence is typically on rather than , we compared the median value of to the values of using the Kruskal-Wallis test, and also compare them to the values of . When compared to there was a very significant p-value of less than 10⁻⁷¹. When compared to there was no significance. This would imply that an analogous process the “mixing” mechanism for exponentiality in Markov processes is responsible for this effect [33]–[35]. In this case the distribution becomes stationary well before all states are explored.

If all amino acids were independently distributed according to their empirical frequencies, the first passage distribution would be a geometric distribution determined by the frequency of amino acid . In that case the independent decay constant, , would be

Associated with this is a geometric decay length, . If , the observed exponential distribution decays more slowly than in the case when amino acids are independently distributed and, likewise, it decays would decay faster if . In Table S2, we show the median value of the ratio . In many cases the ratio is quite close to 1, indicating that the amino acids are basically independent of each other. However for those quantities to be calculated correctly, occurrences due to significant peaks would have to be removed, and a new background frequency inserted into the above formula.

Temperature Dependence of Exponential Decay

The Spearman correlation between the average expected length of first passage and OGT is shown in the final column of Table S2. All distributions where is equal to V, E or Y have a negative correlation less than −0.5. These amino acids are more abundant at high OGT, and therefore an amino acid will typically arrive at them more quickly when this occurs, causing the mean length of first passage to decrease. Likewise, H, T and Q, which become rare at higher OGT, all have longer means. The strongest exception to this rule is . Despite the fact that cysteines typically become rare at higher OGT, its mean length of first passage to another cysteine becomes shorter. The implication is that at high OGT, while C becomes less frequent it is more and more important that cysteines appear near each other. In addition to the significant spike at , cysteines likely have many weak correlations with each other at higher OGT at different length scales. Such an effect only reinforces taking into account the background distribution in a manner that does not assume complete independence of residues.

Supporting Information

Table S1.

Organisms studied, along with optimal growth temperature (OGT).

https://doi.org/10.1371/journal.pone.0101665.s001

(XLSX)

Table S2.

Range of first-passage statistics for four hundred amino acid pairs. Maximum, minimum and median values are listed for within organism p-values and typical first-passage lengths across the organisms studied. The median values for the ratio of the observed first-passage length to what one would expect from a geometric distribution, and the Spearman correlation are also listed.

https://doi.org/10.1371/journal.pone.0101665.s002

(XLSX)

Acknowledgments

BDG was the Eric and Wendy Schmidt member at the Institute for Advanced Study while this work was initiated and would like to thank them for their support. He would like to thank Arnold Levine, Tsvi Tlusty, and Gérard Ben Arous for helpful discussions. The authors would like to thank Professor Talluri Sekhar at GITAM University for his comments regarding the alpha-helical relevance of significant first passages, which are included in the discussion.

Author Contributions

Conceived and designed the experiments: BG AL. Performed the experiments: BG. Analyzed the data: BG AL PK. Wrote the paper: BG PK AL.

References

1. Chung K (1967) Markov Chains with Stationary Transition Probabilities. New York: Springer.
2. Horikoshi K, Antranikian G, Bull AT, Robb FT, Stetter KO (Eds.) (2011) Extremophiles Handbook. Tokyo: Springer.
3. Kumar P, Libchaber A (2013) Pressure and temperature dependence of growth and morphology of Escherichia coli: experiments and stochastic model. Biophys Jour, in press.
4. Zeldovich KB, Berezovsky IN, Shakhnovich EI (2007) Protein and DNA sequence determinants of thermophilic adaptation. PLoS Comp Biol 3: e5.
- View Article
- Google Scholar
5. Singer GA, Hickey DA (2003) Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content. Gene 317: 39–47.
- View Article
- Google Scholar
6. Klipcan L, Safro I, Temkin B, Safro M (2006) Optimal growth temperature of prokayotes correlates with class II amino acid composition. FEBS Lett 580: 1672–1676.
- View Article
- Google Scholar
7. De Farias ST, Bonato MC (2002) Preferred codons and amino acid couples in hyperthermophiles. Gen Biol 3: preprint0006–preprint0006.18.
- View Article
- Google Scholar
8. Bragg JG, Thomas D, Baudouin-Cornu P (2006) Variation among species in proteomic sulpher content is related to environmental conditions. Proc R Soc. B 273: 1293–1300.
- View Article
- Google Scholar
9. Suhre K, Claverie JM (2003) Genomic correlates of hyperthermostability, an update. J Biol Chem 19: 513–23.
- View Article
- Google Scholar
10. Greenbaum BD, Kumar P, Libchaber A (2013) Amino acid distributions and the effect of optimal growth temperature. arXiv:1309.4761.
11. Burge C, Campbell AM, Karlin S (1992) Over- and under-representation of short oligonucleotides in DNA sequences. Proc Nat Acad Sci 4: 1358–1362.
- View Article
- Google Scholar
12. Meyer L, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, et al. (2013) The UCSC genome browser database: extensions and updates. Nucl Acids Res 41: D64–D69.
- View Article
- Google Scholar
13. Schneider K, Pollard KS, Baertsch R, Pohl A, Lowe TM (2006) The UCSC archaeal genome browser. Nucl Acids Res 34: D407–D410.
- View Article
- Google Scholar
14. Feller W (1971) Introduction to Probability Theory and Its Applications, Vol II . New York: Wiley.
15. Liang H, Huang C, Ko M, Hwang J (2005) Amino acid coupling patterns in thermophillic proteins. Prot Struct, Funct, and Bioinfo 59: 58–63.
- View Article
- Google Scholar
16. Zhou X, Wang Y, Li W (2007) Differences in amino acid composition and coupling patterns between mesophillic and thermophillic proteins. Amino Acids 34: 25–33.
- View Article
- Google Scholar
17. Rosato V, Pucello N, Giuliano G (2002) Evidence for cysteine clustering in thermophillic proteomes. Trends in Genet 18: 278–281.
- View Article
- Google Scholar
18. Dokmanic I, Sikic M, Tomic S (2008) Metals in proteins: correlation between the metal-ion type, coordination number and the amino-acid residues involved in the coordination. Acta Cryst D64: 257–263.
- View Article
- Google Scholar
19. Marino SM, Gladyshev VN (2012) Analysis and functional prediction of reactive cysteine residues. J Biol Chem 287: 4419–4425.
- View Article
- Google Scholar
20. Rietsch A, Beckwith J (1998) The genetics of disulfide bond metabolism. Annu Rev Genet 32: 163–184.
- View Article
- Google Scholar
21. Bebby M, O'Connor BD, Ryttersgaard C, Boutz DR, Perry LJ, et al. (2005) The genomics of disulfide bonding and protein stabilization in thermophiles. PLoS Biol 3: 1549–1558.
- View Article
- Google Scholar
22. Wouters MA, Fan SW, Haworth NL (2010) Disulfides as redox switches: from molecular mechanisms to functional significance. Antiox and Redox Signal 12: 53–91.
- View Article
- Google Scholar
23. Ladenstein R, Ren B (2006) Protein disulfides and protein disulfide oxioreductases in hyperthermophiles. FEBS Jour 273: 4170–4185.
- View Article
- Google Scholar
24. Ladenstein R, Ren B (2008) Reconsideration of an early dogma saying “there is no evidence for disulfide bonds in proteins from archaea”. Extremophiles 12: 29–38.
- View Article
- Google Scholar
25. Rosenzweig A (2002) Metallochaperones: bind and deliver. Chem and Biol 9: 673–677.
- View Article
- Google Scholar
26. Dudev T, Lim C (2008) Metal binding affinity and selectivity in metalloproteins: insights from computation studies. Ann Rev Biophys 37: 97–116.
- View Article
- Google Scholar
27. Kumar S, Ma B, Tsai CJ, Nussinov R (2000) Electrostatic strengths of salt Bridges in thermophillic and mesophillic glutamate dehydrogenase monomers. Prot: Struct, Funct, and Bioinfo 38: 368–383.
- View Article
- Google Scholar
28. Chan CH, Yu TH, Wong KB (2011) Stabilizing salt-bridge enhances protein thermostability by reducing the heat capacity change of unfolding. PLoS ONE 6: e21624.
- View Article
- Google Scholar
29. Chakravarty S, Varadarajan R (2002) Elucidation of factors responsible for enhanced thermal stability of proteins: a structural genomics based study. Biochem 41: 8152–8161.
- View Article
- Google Scholar
30. Bosshard HR, Marti DN, Jelesarov I (2004) Protein stabilization by salt bridges. J Mol Recog 17: 1–16.
- View Article
- Google Scholar
31. Barton LL, Fauque GD (2009) Biochemistry, physiology and biotechnology of sulfate-reducing bacteria. Advan in App Microbiol 68: 41–98.
- View Article
- Google Scholar
32. Muyzer G, Stams AJ (2008) The ecology and biotechnology of sulphate-reducing bacteria. Nat Rev Microbiol 6: 441–54.
- View Article
- Google Scholar
33. Keilson J (1979) Markov chain models – rarity and exponentiality. New York: Springer.
34. Aldous D (1982) Markov chains with almost exponential hitting times. Stoc Proc And Their App 12: 305–310.
- View Article
- Google Scholar
35. Aldous D (1983) Random walks on finite groups and rapidly mixing markov chains. Seminare de Probabilitiés, XVII, New York: Springer-Verlag, p. 243–297.
36. Halabi N, Rivoire O, Leiber S, Ranganathan R (2009) Protein sectors: evolutionary units of three-dimensional structure. Cell 138: 774–786.
- View Article
- Google Scholar
37. Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D structure computed from evolutionary sequence variation. PloS One 6: e28766.
- View Article
- Google Scholar
38. Kumar S, Tsai CJ, Nussinov R (2000) Factors enhancing protein thermostability. Protein Engineering 13: 179–191.
- View Article
- Google Scholar

[ref1] 1. Chung K (1967) Markov Chains with Stationary Transition Probabilities. New York: Springer.

[ref2] 2. Horikoshi K, Antranikian G, Bull AT, Robb FT, Stetter KO (Eds.) (2011) Extremophiles Handbook. Tokyo: Springer.

[ref3] 3. Kumar P, Libchaber A (2013) Pressure and temperature dependence of growth and morphology of Escherichia coli: experiments and stochastic model. Biophys Jour, in press.

[ref4] 4. Zeldovich KB, Berezovsky IN, Shakhnovich EI (2007) Protein and DNA sequence determinants of thermophilic adaptation. PLoS Comp Biol 3: e5.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref5] 5. Singer GA, Hickey DA (2003) Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content. Gene 317: 39–47.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref6] 6. Klipcan L, Safro I, Temkin B, Safro M (2006) Optimal growth temperature of prokayotes correlates with class II amino acid composition. FEBS Lett 580: 1672–1676.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref7] 7. De Farias ST, Bonato MC (2002) Preferred codons and amino acid couples in hyperthermophiles. Gen Biol 3: preprint0006–preprint0006.18.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref8] 8. Bragg JG, Thomas D, Baudouin-Cornu P (2006) Variation among species in proteomic sulpher content is related to environmental conditions. Proc R Soc. B 273: 1293–1300.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref9] 9. Suhre K, Claverie JM (2003) Genomic correlates of hyperthermostability, an update. J Biol Chem 19: 513–23.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref10] 10. Greenbaum BD, Kumar P, Libchaber A (2013) Amino acid distributions and the effect of optimal growth temperature. arXiv:1309.4761.

[ref11] 11. Burge C, Campbell AM, Karlin S (1992) Over- and under-representation of short oligonucleotides in DNA sequences. Proc Nat Acad Sci 4: 1358–1362.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref12] 12. Meyer L, Zweig AS, Hinrichs AS, Karolchik D, Kuhn RM, et al. (2013) The UCSC genome browser database: extensions and updates. Nucl Acids Res 41: D64–D69.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref13] 13. Schneider K, Pollard KS, Baertsch R, Pohl A, Lowe TM (2006) The UCSC archaeal genome browser. Nucl Acids Res 34: D407–D410.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref14] 14. Feller W (1971) Introduction to Probability Theory and Its Applications, Vol II . New York: Wiley.

[ref15] 15. Liang H, Huang C, Ko M, Hwang J (2005) Amino acid coupling patterns in thermophillic proteins. Prot Struct, Funct, and Bioinfo 59: 58–63.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref16] 16. Zhou X, Wang Y, Li W (2007) Differences in amino acid composition and coupling patterns between mesophillic and thermophillic proteins. Amino Acids 34: 25–33.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref17] 17. Rosato V, Pucello N, Giuliano G (2002) Evidence for cysteine clustering in thermophillic proteomes. Trends in Genet 18: 278–281.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref18] 18. Dokmanic I, Sikic M, Tomic S (2008) Metals in proteins: correlation between the metal-ion type, coordination number and the amino-acid residues involved in the coordination. Acta Cryst D64: 257–263.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref19] 19. Marino SM, Gladyshev VN (2012) Analysis and functional prediction of reactive cysteine residues. J Biol Chem 287: 4419–4425.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref20] 20. Rietsch A, Beckwith J (1998) The genetics of disulfide bond metabolism. Annu Rev Genet 32: 163–184.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref21] 21. Bebby M, O'Connor BD, Ryttersgaard C, Boutz DR, Perry LJ, et al. (2005) The genomics of disulfide bonding and protein stabilization in thermophiles. PLoS Biol 3: 1549–1558.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref22] 22. Wouters MA, Fan SW, Haworth NL (2010) Disulfides as redox switches: from molecular mechanisms to functional significance. Antiox and Redox Signal 12: 53–91.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref23] 23. Ladenstein R, Ren B (2006) Protein disulfides and protein disulfide oxioreductases in hyperthermophiles. FEBS Jour 273: 4170–4185.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref24] 24. Ladenstein R, Ren B (2008) Reconsideration of an early dogma saying “there is no evidence for disulfide bonds in proteins from archaea”. Extremophiles 12: 29–38.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref25] 25. Rosenzweig A (2002) Metallochaperones: bind and deliver. Chem and Biol 9: 673–677.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref26] 26. Dudev T, Lim C (2008) Metal binding affinity and selectivity in metalloproteins: insights from computation studies. Ann Rev Biophys 37: 97–116.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref27] 27. Kumar S, Ma B, Tsai CJ, Nussinov R (2000) Electrostatic strengths of salt Bridges in thermophillic and mesophillic glutamate dehydrogenase monomers. Prot: Struct, Funct, and Bioinfo 38: 368–383.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref28] 28. Chan CH, Yu TH, Wong KB (2011) Stabilizing salt-bridge enhances protein thermostability by reducing the heat capacity change of unfolding. PLoS ONE 6: e21624.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref29] 29. Chakravarty S, Varadarajan R (2002) Elucidation of factors responsible for enhanced thermal stability of proteins: a structural genomics based study. Biochem 41: 8152–8161.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref30] 30. Bosshard HR, Marti DN, Jelesarov I (2004) Protein stabilization by salt bridges. J Mol Recog 17: 1–16.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref31] 31. Barton LL, Fauque GD (2009) Biochemistry, physiology and biotechnology of sulfate-reducing bacteria. Advan in App Microbiol 68: 41–98.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref32] 32. Muyzer G, Stams AJ (2008) The ecology and biotechnology of sulphate-reducing bacteria. Nat Rev Microbiol 6: 441–54.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref33] 33. Keilson J (1979) Markov chain models – rarity and exponentiality. New York: Springer.

[ref34] 34. Aldous D (1982) Markov chains with almost exponential hitting times. Stoc Proc And Their App 12: 305–310.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref35] 35. Aldous D (1983) Random walks on finite groups and rapidly mixing markov chains. Seminare de Probabilitiés, XVII, New York: Springer-Verlag, p. 243–297.

[ref36] 36. Halabi N, Rivoire O, Leiber S, Ranganathan R (2009) Protein sectors: evolutionary units of three-dimensional structure. Cell 138: 774–786.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref37] 37. Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D structure computed from evolutionary sequence variation. PloS One 6: e28766.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref38] 38. Kumar S, Tsai CJ, Nussinov R (2000) Factors enhancing protein thermostability. Protein Engineering 13: 179–191.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

Using First Passage Statistics to Extract Environmentally Dependent Amino Acid Correlations

Using First Passage Statistics to Extract Environmentally Dependent Amino Acid Correlations

Correction

Figures

Abstract

Introduction

Results

First Passage Distributions

Dependence of Quantities on Habitat

Disscusion

Methods

Correlations

Empirical Evidence for Exponential Distribution

Temperature Dependence of Exponential Decay

Supporting Information

Table S1.

Table S2.

Acknowledgments

Author Contributions

References