Reader Comments

Post a new comment on this article

Referee Comments: Referee 1

Posted by PLOS_ONE_Group on 22 Nov 2007 at 09:53 GMT

Reviewer 1's Review

-----

Review for Shultzaberger et al. [Paper #07-PONE-RA-01652] “Determining physical
constraints in transcriptional initiation complexes using DNA sequence analysis”.
The manuscript by Shultzaberger et al. aims to determine constraints on the binding
sites of two yeast transcription factors, Cbf1 and Met31, involved in the synthesis of
sulfur-containing amino acids in yeast. The approach is based on an information-theoretic formalism previously used to examine physical constraints in several systems in E. coli. The basic approach of combining this type of sequence analysis with available expression data seems promising. The authors identify a spacing bias of the Cbf1 and Met31 binding sites that helps to improve computational identification of target genes directly regulated by these two factors. However, the data concerning the distance between sites and site orientation is less convincing, and it is not clear to me what is gained by the introduction of the surprisal functions to this analysis. I have reservations regarding the statistical significance of the results and consequently the interpretation of the data. Additionally, I found the manuscript to be unclear at times, particularly concerning methodological details.

-----

N.B. These are the general comments made by the reviewer when reviewing this paper in light of which the manuscript was revised. Specific points addressed during revision of the paper are not shown.

RE: Referee Comments: Referee 1

mbeisen replied to PLOS_ONE_Group on 10 Jan 2008 at 22:27 GMT

this wasn't the complete review (it is posted below) - we responded to the reviewer's critiques in a substantially revised manuscript - our response is posted in a subsequent comment

REVIEW:
Review for Shultzaberger et al. [Paper #07-PONE-RA-01652] “Determining physical
constraints in transcriptional initiation complexes using DNA sequence analysis”.

The manuscript by Shultzaberger et al. aims to determine constraints on the binding
sites of two yeast transcription factors, Cbf1 and Met31, involved in the synthesis of
sulfur-containing amino acids in yeast. The approach is based on an information-theoretic
formalism previously used to examine physical constraints in several systems in E. coli.
The basic approach of combining this type of sequence analysis with available expression
data seems promising. The authors identify a spacing bias of the Cbf1 and Met31 binding
sites that helps to improve computational identification of target genes directly regulated
by these two factors. However, the data concerning the distance between sites and site
orientation is less convincing, and it is not clear to me what is gained by the introduction
of the surprisal functions to this analysis. I have reservations regarding the statistical
significance of the results and consequently the interpretation of the data. Additionally, I
found the manuscript to be unclear at times, particularly concerning methodological
details. I have detailed my questions and any areas of confusion below.


Section 3.2: Searching algorithm

1) Paragraph 1. The authors write, “For our initial analysis of Cbf1 and Met31, we used a
flat spacing distribution, where all spacing have the same gap surprisal”. The precise flat
surprisal function used is not revealed until the second to last paragraph of the Results
section. It would be much clearer if this could be discussed earlier in the manuscript.
Specific questions regarding this surprisal function are discussed later. As well, details
regarding how the distance used in the surprisal function are calculated could similarly be
provided earlier, possibly in the Methods. That the distance was calculated between the
zero positions of the binding components was given until the middle of section 4.2

2) Paragraph 5. “Microarray expression data for sulfur amino acid pathway-affected cells
were then averaged for the top 20 genes in our ranking.” Is this fold change in log2 or
log10 units?


Figure 2.

3) There are 29 genes listed for each heat map. Are the 20 top-scoring genes (based on
the informational model if the binding sites) the 20 at the top? What are the 29 shown?
To double check then, are the scores the average of the expression changes for these 29
or just the top 20?


Section 4.2: Orientation and ordering

4) Paragraph 2. “By searching for Cbf1 and Met31 sites together, with a maximum
spacing of 100 bases between the zero positions of the binding components and the downstream component could [sic] be a maximum of 1000 bases upstream of the gene
start, the prediction was better.” Was this ‘1000 bases upstream’ the same constraint as
used for the Cbf1 and Met31 sites alone?

5) Paragraph 3 “...even though the sites appeared low in the ranking (0.78 and -1.49 vs.
0.28 and -0.64).” First question, the numbers 0.78 and -1.49 don’t correspond with those
in the figure. I assume they authors are referring to the 0.86, -1.65 numbers. As well, to
what does “low in the ranking” refer?

6) Last paragraph. Again, the numbers (0.99 and -1.70) don’t correspond to any in the
figure.

7) Are the average expression change values statistically significant? Some p-value-type
estimate of the significance of these scores is needed.

Section 4.3: Spacing Constraints

8) There is a rather large parameter space being explored here for these two constraints
(20 x 5000 = 100,000 possible sets of parameters values), but only the 20 top genes are
used to define the expression change value. My first question concerns the robustness of
the values determined, how do these value change when 30 or 40 top genes are used as
opposed to 20? Related to this is whether some evidence for the significance of these
estimates could be given. Is the expression fold change of ~1.4 in Figure 4 significant or
is it simply the result of a handful of high-scoring genes that can strongly influence a
value averaged over only 20 genes.

9) Last paragraph. The average expression changes when the two spacing constraints are
used were 1.39 and -2.21 for induction and repression data respectively. Are these
numbers significantly better than the values of 1.16 and -2.09 presented in Figure 2
without any constraints? Admittedly these are high and lower respectively, but you’ve
added additional parameters to the equation and this small improvement might simply be
due to further parameterization (i.e. throwing out some bad data points).

Section 4.4: Optimal Model

10) Bottom of paragraph 3. “Our set had an average helical positioning greater than 99
percent of random sets”. How is this ‘average helical positioning’ calculated and
compared to that of the random sets?

11) Top of paragraph 4. The gap surprisal is defined as GS(d) = -log2(1/56) = 5.81 bits.
The 1/56 results from the range of 13 to 68 bases for the min and max constraint
distances. But these values were determined after all the optimization was done. So, how
was the surprisal determined prior to this optimization when the scores were being used
to rank the genes?

12) Paragraph 4. So the gap and orientation surprisal become a constant 6.81 bits which
is subtracted from each Rsequence(Met4) score (Equation 5). Why are these even
included given that they can’t be better determined? What do we gain in this analysis
from using these metrics? Because these are really under-determined constants I’m not
sure how to interpret the 18.0 bits of information assigned to the Rsequence(Met4). Had
more than the 20 top genes been used these optimal parameters would likely have been
different and this number of 18.0 bits (via the surprisal) would be different, so how
should it be interpreted?


RE: RE: Referee Comments: Referee 1

mbeisen replied to mbeisen on 10 Jan 2008 at 22:29 GMT

Thank you for considering the manuscript "Determining physical constraints in transcriptional initiation complexes using DNA sequence analysis" for publication in PLoS ONE. The reviewer's comments were well thought out and helpful in our improving the paper. We address each of the reviewer's comments (which are in bold) below.


Section 3.2: Searching algorithm

1) Paragraph 1. The authors write, “For our initial analysis of Cbf1 and Met31, we used a flat spacing distribution, where all spacing have the same gap surprisal”. The precise flat surprisal function used is not revealed until the second to last paragraph of the Results section. It would be much clearer if this could be discussed earlier in the manuscript. Specific questions regarding this surprisal function are discussed later. As well, details regarding how the distance used in the surprisal function are calculated could similarly be provided earlier, possibly in the Methods. That the distance was calculated between the zero positions of the binding components was given until the middle of section 4.2.

This is a good point. We have changed this sentence and have added an additional sentence in the Methods section.

"For our initial analysis of Cbf1 and Met31, we used a flat spacing distribution where all spacings have the same gap surprisal value of GS(d) = - log2 (1/(dmax - dmin +1)), where dmin is the shortest spacing between Met31 and Cbf1, and dmax is the longest spacing. The distance between Met31 and Cbf1 is calculated between the zero positions of the binding components as with previous flexible models."

2) Paragraph 5. “Microarray expression data for sulfur amino acid pathway-affected cells were then averaged for the top 20 genes in our ranking.” Is this fold change in log2 or log10 units?

We have added the following sentence after the above sentence.

"All values averaged were log2 of the expression fold change between affected and unaffected cells."

Figure 2.

3) There are 29 genes listed for each heat map. Are the 20 top-scoring genes (based on the informational model if the binding sites) the 20 at the top? What are the 29 shown? To double check then, are the scores the average of the expression changes for these 29 or just the top 20?

We now average the top 30 scoring genes instead of the top 20. The reason for this is in response to comment number 8 (below). We have changed the figure to show the top 30 genes, so the number of genes in the figure is now consistent with the number of genes averaged.


Section 4.2: Orientation and ordering

4) Paragraph 2. “By searching for Cbf1 and Met31 sites together, with a maximum spacing of 100 bases between the zero positions of the binding components and the downstream component could [sic] be a maximum of 1000 bases upstream of the gene start, the prediction was better.” Was this ‘1000 bases upstream’ the same constraint as used for the Cbf1 and Met31 sites alone?

This is another good point. We added the following sentence to this paragraph.

"For both Cbf1 and Met31, we only considered binding sites within 1000 bases upstream of the closest gene start."

5) Paragraph 3 “…even though the sites appeared low in the ranking (0.78 and -1.49 vs. 0.28 and -0.64).” First question, the numbers 0.78 and -1.49 don’t correspond with those in the figure. I assume they authors are referring to the 0.86, -1.65 numbers. As well, to what does “low in the ranking” refer?

We would like to thank the reviewer for catching this, the values in the text correspond to an older analysis, and were not changed. Since we now average the top 30 genes, instead of the top 20 genes, the values in the figure and the text have both been changed and are now consistent with each other. “Low in the ranking" is a vague statement. We have removed this from the paper.

6) Last paragraph. Again, the numbers (0.99 and -1.70) don’t correspond to any in the figure.

The values are now consistent.

7) Are the average expression change values statistically significant? Some p-value-type estimate of the significance of these scores is needed.

We have added the following paragraph to the end of section 4.2 in the results section.

“To test whether the average expression values that we observed are statistically significant, we randomly chose 10,000 sets of 30 genes from the genome and averaged their expression change values. We did this for both the induced and repressed data sets. Both sets gave similar normal distributions with a mean of -0.015 and SD of 0.11 for the induced data set and a mean of 0.001 and SD of 0.094 for the repressed data set. For the best organization of sites in Figure 2, an expression change of 0.95 and -1.57 would be 8.7 and 16.7 standard deviations from the mean respectively. The probability of selecting a set of 30 genes with an average expression change this high randomly would be less than 1 X 10-8.”

Section 4.3: Spacing Constraints

8) There is a rather large parameter space being explored here for these two constraints (20 x 5000 = 100,000 possible sets of parameters values), but only the 20 top genes are used to define the expression change value. My first question concerns the robustness of the values determined, how do these value change when 30 or 40 top genes are used as opposed to 20? Related to this is whether some evidence for the significance of these estimates could be given. Is the expression fold change of 1.4 in Figure 4 significant or is it simply the result of a handful of high-scoring genes that can strongly influence a value averaged over only 20 genes.

These values are fairly robust. When we averaged with the top 30 genes instead of the top 20, our results only differed slightly. The spacing range expanded from 13 to 68 bases to 9 to 68 bases. When we averaged using the top 40 genes, the spacing range was also 9 to 68 bases. By only averaging the top 20 genes we were slightly over-fitting our model. The larger spacing range picked up three additional genes in the top 30, GSH1, SAM1 and YER080W. GSH1 and SAM1 had the expected expression pattern and are both known to be involved in sulfur assimilation. YER080W expression pattern was not as striking. We decided to redo all of the analysis in the paper by averaging the top 30 genes instead of 20.


9) Last paragraph. The average expression changes when the two spacing constraints are used were 1.39 and -2.21 for induction and repression data respectively. Are these numbers significantly better than the values of 1.16 and -2.09 presented in Figure 2 without any constraints? Admittedly these are high and lower respectively, but you’ve added additional parameters to the equation and this small improvement might simply be due to further parameterization (i.e. throwing out some bad data points).

We are not sure how to show that these are significantly better. According to the random distribution of average expression changes mentioned in Point 7, these values are an additional standard deviation from the mean. It is important to note that there are constraints on the model in Figure 2 (spacing range of 1-100, maximum distance 1000).

Section 4.4: Optimal Model

10) Bottom of paragraph 3. “Our set had an average helical positioning greater than 99 percent of random sets”. How is this ‘average helical positioning’ calculated and compared to that of the random sets?

We admit that this description is difficult and we have tried to simplify this section. The paragraph now reads:

“To test whether there is a tendency for Cbf1 and Met31 to bind on the same face of the DNA, we plotted the relative spacing between the two sites on a cosine wave with the same period as B-form DNA, 10.6 bases (Fig. 5). We plotted the spacings of 19 of the 23 top ranking genes (all sites except for Reb1, Gar1, Idh1 and YER080W) and YHR112C, Mxr1, Met10, and YML018C which had both a strong flexible information and expression change. To determine what the optimal phase of the cosine wave was, we plotted each spacing on a cosine wave and calculated the average height of all spacings on the helix. That is, if all spacings were at the top of the cosine wave (occurred in multiples of 10.6 bases) then the average helical location would be high. We determined the phase of the cosine wave that gave the highest average helical location of these 23 spacings, and found the optimal phase to peak at $-13.86$ bases relative to the Met31 zero position. To see if the relative placement of these spacings on the cosine wave is higher than expected, we determined the average helical location of random sets of 23 Cbf1/Met31 pairs. Our set had an average helical positioning greater than 95 percent of random sets.”


11) Top of paragraph 4. The gap surprisal is defined as GS(d) = -log2(1/56) = 5.81 bits. The 1/56 results from the range of 13 to 68 bases for the min and max constraint distances. But these values were determined after all the optimization was done. So, how was the surprisal determined prior to this optimization when the scores were being used to rank the genes?

This is addressed in point 1. We define how the gap can be determine for any range.

12) Paragraph 4. So the gap and orientation surprisal become a constant 6.81 bits which is subtracted from each Rsequence(Met4) score (Equation 5). Why are these even included given that they can’t be better determined? What do we gain in this analysis from using these metrics? Because these are really under-determined constants I’m not sure how to interpret the 18.0 bits of information assigned to the Rsequence(Met4). Had more than the 20 top genes been used these optimal parameters would likely have been different and this number of 18.0 bits (via the surprisal) would be different, so how should it be interpreted?

The reviewer is right that there is no advantage in this instance for using the gap and orientation surprisal for individual analysis of Met4 sites over simple boolean classifiers, but the information content (Rseq(Met4)) we present does appear to be a reasonable estimate. By using the larger range resulting from averaging the top 30 or 40 genes, the information content only changed by 0.1 bits. This difference is minor. From an evolutionary standpoint, knowing the information content for this flexible system is interesting, and using the gap and orientation surprisal is the best way to do that.

The fact that we do so well in predicting Met4 sites by just taking into account the strength of the sites, and not effects due to spacing is interesting. In bacteria, transcription factors interact at shorter distances, and the effect of spacing on stability is greater (since it is more difficult to bend a short piece of DNA than a large one). For coordinated binding of Met4, the summed affinity of Cbf1 and Met31 is what dominates the stability of the initiation complex, suggesting that the energetic penalties associated with spacing have been considerably decreased for eukaryotes. We have now modified the Abstract and Discussion to emphasize this point.


Thank you again for considering this paper. We look forward to your response