iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties

Wei Chen; Hao Lin; Peng-Mian Feng; Chen Ding; Yong-Chun Zuo; Kuo-Chen Chou

doi:10.1371/journal.pone.0047843

Abstract

Nucleosome positioning has important roles in key cellular processes. Although intensive efforts have been made in this area, the rules defining nucleosome positioning is still elusive and debated. In this study, we carried out a systematic comparison among the profiles of twelve DNA physicochemical features between the nucleosomal and linker sequences in the Saccharomyces cerevisiae genome. We found that nucleosomal sequences have some position-specific physicochemical features, which can be used for in-depth studying nucleosomes. Meanwhile, a new predictor, called iNuc-PhysChem, was developed for identification of nucleosomal sequences by incorporating these physicochemical properties into a 1788-D (dimensional) feature vector, which was further reduced to a 884-D vector via the IFS (incremental feature selection) procedure to optimize the feature set. It was observed by a cross-validation test on a benchmark dataset that the overall success rate achieved by iNuc-PhysChem was over 96% in identifying nucleosomal or linker sequences. As a web-server, iNuc-PhysChem is freely accessible to the public at http://lin.uestc.edu.cn/server/iNuc-PhysChem. For the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented just for the integrity in developing the predictor. Meanwhile, for those who prefer to run predictions in their own computers, the predictor's code can be easily downloaded from the web-server. It is anticipated that iNuc-PhysChem may become a useful high throughput tool for both basic research and drug design.

Citation: Chen W, Lin H, Feng P-M, Ding C, Zuo Y-C, Chou K-C (2012) iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties. PLoS ONE 7(10): e47843. https://doi.org/10.1371/journal.pone.0047843

Editor: Peter Csermely, Semmelweis University, Hungary

Received: July 25, 2012; Accepted: September 21, 2012; Published: October 29, 2012

Copyright: © 2012 Chen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by Grants from The National Nature Scientific Foundation of China (No. 61100092), The Scientific Research Startup Foundation of North China Coal Medical University (No. 10101115), Foundation of Scientific and Technological Department of Hebei Province (No. 11275532), The Scientific Research Foundation of Sichuan Province (No. 2009JY0013) and The Fundamental Research Funds for the Central Universities (No. ZYGX2009J081). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In eukaryotic cells, genomic DNA is highly compacted into several levels of chromatin structures that ultimately make up the chromosomes. At the lowest level of compaction, a ∼147 bp DNA sequence is tightly wrapped around the histone-octamer core (Fig. 1) into the elementary structural unit of chromatin, known as nucleosome [1]. The packaging of DNA around the histone-octamer modulates the accessibility of genomic regions to regulatory proteins. There are close relationships between nucleosome positioning and key cellular processes, as demonstrated in mRNA splicing, DNA replication, and DNA repair [2], [3], [4]. Consequently, revealing the mechanism involved in controlling nucleosome positioning is fundamentally important for in-depth understanding the subsequent steps of gene expression.

Download:

Figure 1. A schematic illustration to show the basic architecture of nucleosome.

Nucleosomes form the fundamental repeating units of eukaryotic chromatin (upper panel), each of them consists of approximately 147 base pairs of DNA wrapped in 1.67 left-handed superhelical turns around a histone octamer consisting of 2 copies each of the core histones H2A, H2B, H3, and H4 (lower panel).

https://doi.org/10.1371/journal.pone.0047843.g001

High-resolution genome-wide nucleosome maps are now available for several model organisms, such as Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens [5], [6], [7], [8], [9]. These high-resolution data provide unprecedented opportunities for further investigating the roles of nucleosome positioning in gene regulation. However, experimental approach is expensive to perform genome-wide analysis of nucleosome distribution. In this regard, computational methods can be applied to the entire genome without this kind of disadvantage. Since the report of the nucleosome positioning code (∼10 bp repeating pattern of dinucleotides AA-TT-TA/GC) in yeast [8], lots of theoretical works have been done attempting to elucidate nucleosome occupancy signals that determine the preference of a particular region in binding to histones and forming a nucleosome [10], [11], [12]. Although of great interest and value, sequence-based predictions of nucleosome positioning have been limited in their accuracy and resolution, and to which extent nucleosome positioning in vivo is really dictated by the DNA sequence [10] is still an issue of controversy [13].

It was reported by Miele et al. [7] that DNA physical-chemical properties may determine nucleosome occupancy. Moreover, the recent study by Nozaki et al. [14] also suggested the existence of a highly bendable, fragile structure for nucleosomal DNA, implying that nucleosomal sequences indeed have distinct structural properties when compared with linker sequences.

In view of this, the present study was initiated in an attempt to develop a new method for predicting nucleosomal sequences based on the physicochemical properties of DNA.

According to a recent review [15], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (1) construct or select a valid benchmark dataset to train and test the predictor; (2) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (5) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps.

Materials and Methods

1. Benchmark Dataset: Nucleosomal and Linker Sequences

The reference genome sequence of Saccharomyces cerevisiae was obtained from the Saccharomyces Genome Database (http://www.yeastgenome.org/). The nucleosome positions of Saccharomyces cerevisiae were derived from the published data obtained by Lee et al. [16], where each of the 1,206,683 DNA fragments in the dataset constructed by these authors was assigned a nucleosome formation score using a lasso model, with the high or low score to reflect its high or low propensity in forming nucleosome, respectively. The low score can also be interpreted as the propensity to inhibit the formation of nucleosome. To prepare a high quality benchmark dataset, 5,000 fragments of 150 bp with the highest scores were selected as the nucleosome-forming sequence samples to construct the positive set , and 5,000 fragments of 150 bp with the lowest scores were selected as the nucleosome-inhibiting (or linker) sequence samples to construct the negative set ; i.e., the benchmark dataset in this study can be formulated as(1)where represents the symbol for “union” in the set theory, and(2)For the convenience of readers, the 5,000 sequences in and 5,000 sequences in are given in the Information S1.

2. Feature Vectors based on DNA Physicochemical Properties

Owing to their important roles in various different biological processes, the intrinsic physicochemical properties of DNA sequences have been intensively studied by many investigators [17], [18], [19], [20], [21]. In the present study, the following twelve DNA physicochemical properties are to be considered: (1) A-philicity [22], (2) base stacking [23], (3) B-DNA twist [24], (4) bendability [25], (5) DNA bending stiffness [26], (6) DNA denaturation [27], (7) duplex disrupt energy [28], (8) duplex free energy [29], (9) propeller twist [30], (10) protein deformation [31], (11) protein-DNA twist [31], and (12) Z-DNA [32].

In order to quantitatively analyze the physical and chemical properties of the DNA sequence samples, we firstly converted the retrieved nucleosomal and linker sequences into numerical profiles according to the following schemes as validated by Florquin et al. [18]. The detailed procedures are as following steps.

Step 1.

For any 2 base pair (bp) piece of DNA, there is a corresponding numerical value associated with any one of the aforementioned 12 physicochemical properties. Since the values of the 12 properties were at different levels, to make them easier to be handled, we normalized them into the range by means of the following equation(3)where is the original value of the i-th DNA physicochemical property (i = 1, 2, …, 12) for the j-th (j = AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) dinucleotide (see Information S2); while the corresponding normalized value (see Table 1).

Download:

Table 1. The normalized values for the 12 physicochemical properties of dinucleotide.

https://doi.org/10.1371/journal.pone.0047843.t001

Step 2.

By means of a sliding window [33], [34] approach with a window size of 2 bp and a step size of 1 bp, a DNA sequence was replaced by the corresponding normalized physicochemical values. Thus, each of the sequences in was translated into a numerical vector consisting of components, i.e., a 149-D (dimensional) numerical vector.

Step 3.

After going through the above step with all the 12 physicochemical properties, each of the sequences in was translated to 12 different 149-D vectors corresponding to the 12 physicochemical features. By combining the 12 vectors, we obtain an integrated vector containing components; i.e., each of the nucleosomal sequences in can be formulated as a 1788-D vector(4)where the first 149 components were derived from the property “A-philicity” or P(1), the second 149 components from the property “base stacking” or P(2), the last 149 components from the property “Z-DNA” or P(12) (cf. Table 1), and T the transposing operator.

3. Covariant or Quadratic Discriminant Function

The covariant discriminant (CD) or quadratic discriminant (QD) function has been widely used in the realm of bioinformatics, such as protein structural class prediction [35], [36], [37], protein coding region identification [38], protein subcellular location prediction [39], [40], splice site prediction [41], membrane protein type and location prediction [42], out membrane protein prediction [43], enzyme family class prediction [44], antimicrobial peptide classification [45], and prediction of protein cellular attributes [46].

Its formulation can be briefly described as follows. Suppose the standard feature vectors for the DNA sequences in and are, respectively, expressed by(5)where(6)where is the u-th component of the feature vector for the k-th sequence in the positive dataset , that for the k-th sequence in the negative dataset , the total number of DNA sequences in , that in , and the total number of components in a feature vector. For the current case, we have (cf. Information S1) and (cf. Eq.4).

Thus, whether a query DNA sequence belongs to the nucleosome-forming subset or nucleosome-inhibiting subset will be judged by(7)where is the argument of that minimizes , which is defined by(8)where(9)is the squared Mahalanobis distance [47], [48], [49] between and ,(10)is the covariance matrix [50] for the subset , the elements in are given by(11) is the inverse matrix of , and is the determinant of the matrix . Therefore, the covariance discriminant function is also called the modified Mahalanobis discriminant function [51], [52]. More description about the covariance discriminant function and its application in biology can be found in a review [50].

4. Performance Evaluation

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling (K-fold cross-validation) test, and jackknife test. However, as elaborated in ref. [53] and demonstrated by Eqs.28–32 of [15], among the three cross-validation methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (see, e.g., [54], [55], [56], [57], [58], [59], [60]). However, since the current study would involve feature selection as described below, to reduce the computational time, the 5-fold cross-validation test would be adopted as done by many investigators using SVM (Support Vector Machine) as the prediction engine.

Also, to use a more intuitive and easier-to-understand method to measure the prediction quality, according to the definition [33], [61], the rates of correct predictions for the nucleosome-forming dataset and the nucleosome-inhibiting dataset are respectively defined by(12)where is the total number of nucleosome-forming sequences concerned and the number of nucleosome-forming sequences missed in prediction; the total number of nucleosome-inhibiting sequences concerned and the number of nucleosome-inhibiting sequences missed in prediction. The overall success prediction rate is given by(13)It is clear from Eqs.12–13 that, if and only if none of nucleosome-forming sequences and nucleosome-inhibiting sequences are mispredicted, i.e., and , we have the overall success rate . Otherwise, the overall success rate would be smaller than 1.

5. Feature Selection

Inclusion of redundant and noisy information would cause poor prediction results and increase computational time. To improve the prediction quality and gain deeper insights into the physicochemical properties of nucleosomal sequences, we performed feature selection using the wrapper-type feature selection algorithm called “fselect.py”, which can be downloaded at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools. The basic idea of this algorithm is to rank each of the features involved according to a score as elaborated by Chen and Lin [62]. The ranked feature with a higher score indicates that it is a more highly relevant one for the target to be predicted. Based on the ranked features, we used the Incremental Feature Selection (IFS) [63] to determine the optimal number of features. During the IFS procedure, features in the ranked feature set were added one by one from higher to lower rank. A new feature set was composed when one feature had been added. Thus, the N feature sets thus formed would be composed of N ranked features. The feature set can be formulated as(14)For each of the N feature sets, a CD prediction model (cf. Eq.7) was constructed and examined with the 5-fold cross-validation on the benchmark dataset. By doing so, we obtained an IFS curve in a 2D Cartesian coordinate system with index as its abscissa (or X-coordinate) and the overall success rate as its ordinate (or Y-coordinate). The optimal feature set is defined by(15)with which the IFS curve reaches its peak. In other words, in the 2D coordinate system, when the value of is the maximum. Thus, we can use the features in Eq.15 to build the final predictor.

The predictor established by going through all the above procedures is called iNuc-PhysChem. Meanwhile, a user-friendly web-server for the predictor was also established as will be describe at the end of the paper.

Results and Discussion

1. Graphic Profiles of Nucleosome and Non-nucleosome Sequences

Different from the previous methods [10], [11], [12] that were mostly based on the sequence compositional features, we carried out a graphic profile comparison between nucleosomal and linker (non-nucleosomal) sequences in order to explore the specific features possessed by nucleosomal sequences. Using graphic approaches to study biological problems can provide an intuitive picture or useful insights for revealing complicated relations in these systems, as demonstrated by many previous studies on a series of important biological topics, such as enzyme-catalyzed reactions [64], [65], [66], [67], inhibition of HIV-1 reverse transcriptase [68], [69], protein folding kinetics [70], drug metabolism systems [71], and using wenxiang diagram or graph [72] to study protein-protein interactions [73], [74], [75]. To introduce graphic approach for the current study, let us use the conversion scheme [18] to transform the nucleosome and non-nucleosome sequences into the numerical vectors (cf. Eq.4). To intuitively show the difference between these two different types of sequences, a graphic expression of the standard feature vector (cf. Eq.5) for the nucleosomal sequences and that for the non-nucleosomal sequences are given in Fig. 2, which consists of 12 panels corresponding to 12 physicochemical properties of DNA sequences (cf. Section 2 of Materials and Method). The curves in the “A-philicity” panel reflect the first 149 components in the two standard feature vectors, those in the “base stacking” panel reflect the second 149 components, and so forth. It is interesting to note that, except for the “B-DNA twist” panel and “Protein-DNA twist” panel, the differences between the nucleosomal and non-nucleosomal sequences are quite remarkable in all the other 10 panels. These findings suggest that the two physicochemical properties might play a less role in distinguishing nucleosomal and non-nucleosomal sequences than the other 10 properties.

Download:

Figure 2. Graphic profiles to show the difference between nucleosomal (red) and linker (blue) sequences.

It contains 12 panels drawn according to their standard feature vectors (cf. Eq.5), with each to reflect one of the 12 physicochemical features as marked at the bottom of each panel.

https://doi.org/10.1371/journal.pone.0047843.g002

2. Comparison of the 12 Properties in Classification Performance

In order to compare the 12 physicochemical properties for the classification performance, the feature vector Eq.4, standard vector Eq.5, and classifier Eq.7 were reduced from the original 1788-D working space to twelve 149-D sub-working spaces. Each of the sub-working spaces corresponds to one of the 12 physicochemical properties. Shown in Fig. 3 are their success rates in the classification performance when examined by the 5-fold cross-validation on the benchmark dataset . As can be seen from Fig. 3, the success rates obtained by using the “B-DNA twist” and “protein-DNA twist” properties are indeed remarkably lower than those by most of the other properties, quite consistent with the graphic profile analysis of last section.

Download:

Figure 3. Comparison of success rates based on different physicochemical properties.

The orange column shows , the rate of correct prediction for the nucleosome-forming dataset (cf. Eq.10); the blue column shows , the rate of correct predictions for the nucleosome-inhibiting dataset; the purple column shows , the overall success rate (cf. Eq.11).

https://doi.org/10.1371/journal.pone.0047843.g003

3. Selection of Position Specific DNA Features

To identify the key features for nucleosomal sequence prediction, we used the wrapper-type feature selection algorithm and IFS approach as described in Section 5 of Materials and Method.

By adding the ranked features one by one according to the scores calculated by fselect.py, we built 1,788 individual CD predictors for the 1,788 sub-feature sets. We then tested the prediction performance for each of the 1,788 predictors and plotted the IFS curve as shown in Fig. 4, from which we can see that, when the top ranked 884 features were used, the overall success rate reached its peak, i.e., (cf. Eq.13), with for the nucleosome-forming sequences and for the nucleosome-inhibiting sequences (cf. Eq.12).

Download:

Figure 4. A plot to show the IFS procedure.

When the top 884 of the 1,788 features were used to perform prediction, the overall success rate reached its peak of 0.967.

https://doi.org/10.1371/journal.pone.0047843.g004

In other words, we have (cf. Eq.15) and the optimal feature set for the current biological system should be(16)To provide an overall view, a distribution of the 12 physicochemical features and their roles for the prediction model is given in Fig. 5, where the green boxes indicate the features that were not contained in the optimal feature set S₈₈₄. The red and purple boxes indicate the features that were included in the optimal feature set S₈₈₄: features in red boxes were positively correlated with nucleosomal sequences, while those in purple boxes were negatively correlated with nucleosomal sequences.

Download:

Figure 5. A distribution overall view for the 12 physicochemical features.

The features that were included in the optimal feature set S₈₈₄ are shown in the red and purple boxes: the former was positively correlated with nucleosomal sequences, while the latter negatively correlated with nucleosomal sequences. Those features that were not in the optimal feature set S₈₈₄ are shown in the green boxes.

https://doi.org/10.1371/journal.pone.0047843.g005

4. Comparison with Existing Methods

Based on the 2-mer absolute frequency of nucleotides, Zhang et al. [76] proposed a model to distinguish nucleosomal and linker sequences. When tested by the 5-fold cross-validation on the benchmark dataset, their method achieved an overall success rate of 95.70%, which is lower than that by the present method.

Furthermore, our model trained on the yeast data was also applied to the human genome. According to the human reference genome (hg 18), we randomly extracted 1000 nucleosomal and 1000 linker sequences from the high-resolution experimental data of human CD 4⁺ T cell [9]. Our model achieved an overall success rate of 98.5% for classifying the experimentally confirmed nucleosomal and linker sequences in the human genome. This result is higher than 93.8% obtained by using the model proposed by Peckham et al. [10], which has also been applied to predict human nucleosomal sequences by Gupta et al. [12]. All these results indicate that it is a quite promising approach by incorporating the DNA physicochemical features for predicting the nucleosomal sequences, and also suggest a conserved mechanism of nucleosome positioning across genomes.

Different with most current nucleosome positioning prediction methods that were solely relied on local sequence compositional information, in this study we developed a new method by incorporating the physicochemical features of DNA sequences. Our rationale to do so is that, different from the other nucleotide information, the physicochemical properties might affect DNA binding of regulatory proteins, either directly by hampering or favoring complex formation, or indirectly through the modulation of the chromatin structure and hence the DNA accessibility [77]. Therefore, the current method may become a useful vehicle for in-depth studying nucleosomes.

5. Web-Server Guide

For the convenience of the vast majority of experimental scientists, below let us give a step-by-step guide on how to use the iNuc-PhysChem web-server to get their desired results without the need to follow the complicated mathematic equations that were presented just for the integrity in developing the predictor.