DNA Methylation Patterns Facilitate the Identification of MicroRNA Transcription Start Sites: A Brain-Specific Study

Tapas Bhadra; Malay Bhattacharyya; Lars Feuerbach; Thomas Lengauer; Sanghamitra Bandyopadhyay

doi:10.1371/journal.pone.0066722

Abstract

Predicting the transcription start sites (TSSs) of microRNAs (miRNAs) is important for understanding how these small RNA molecules, known to regulate translation and stability of protein-coding genes, are regulated themselves. Previous approaches are primarily based on genetic features, trained on TSSs of protein-coding genes, and have low prediction accuracy. Recently, a support vector machine based technique has been proposed for miRNA TSS prediction that uses known miRNA TSS for training the classifier along with a set of existing and novel CpG island based features. Current progress in epigenetics research has provided genomewide and tissue-specific reports about various phenotypic traits. We hypothesize that incorporating epigenetic characteristics into statistical models may lead to better prediction of primary transcripts of human miRNAs. In this paper, we have tested our hypothesis on brain-specific miRNAs by using epigenetic as well as genetic features to predict the primary transcripts. For this, we have used a sophisticated feature selection technique and a robust classification model. Our prediction model achieves an accuracy of more than 80% and establishes the potential of epigenetic analysis for in silico prediction of TSSs.

Citation: Bhadra T, Bhattacharyya M, Feuerbach L, Lengauer T, Bandyopadhyay S (2013) DNA Methylation Patterns Facilitate the Identification of MicroRNA Transcription Start Sites: A Brain-Specific Study. PLoS ONE 8(6): e66722. https://doi.org/10.1371/journal.pone.0066722

Editor: Walter Lukiw, Louisiana State University Health Sciences Center, United States of America

Received: April 1, 2013; Accepted: May 2, 2013; Published: June 24, 2013

Copyright: © 2013 Bhadra et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: TB gratefully acknowledges Department of Science and Technology, India, for awarding him the INSPIRE Fellowship to carry out his PhD. research work. SB gratefully acknowledges the financial support from the Swarnajayanti project grant no. DST/SJF/ET-02/2006-07 of the Department of Science and Technology, Government of India. Part of the work was conducted when SB visited the Max Planck Institute for Informatics, Saarbrucken, Germany, in 2011-12 on a Humboldt Fellowship for Experienced Researchers. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

MicroRNAs (miRNAs) are a class of short (22 nt) non-coding RNAs that control the translation and stability of protein-coding genes [1]. They regulate genes through translational repression or post-transcriptional regulation [2], [3]. Thus, miRNAs are important in many cellular functions and accountable for many diseases [4]. It is known that miRNAs exert regulatory activities in their mature stage, which is reached after cellular processing of primary miRNAs (pri-miRNAs) and precursor miRNAs (pre-miRNAs) transcribed from the DNA. Pri-miRNAs are much longer transcripts that are first transcribed from the DNA. The removal of a portion of pri-miRNA by the nuclear RNase III enzyme Drosha produces the pre-miRNA, a – nt intermediate [5]. Finally, the pre-miRNAs become mature miRNAs by the operation of another RNase III enzyme Dicer. The mature miRNAs, along with RISC, bind to the untranslated regions (UTRs) of mRNAs and regulate their expression. A significant amount of information is available about the loci of pre-miRNAs and mature miRNAs. But due to the inadequate information on experimentally validated transcription start sites (TSSs), which manifest the transcription initiation loci of pri-miRNAs, very little is known about pri-miRNA transcripts. The in silico prediction of TSSs in the upstream region of pre-miRNAs can contribute significantly to identifying such transcripts. Moreover, recent findings suggest that pri-miRNAs can also take part in the regulation of genes [6]. Therefore, the identification of the pri-miRNA transcripts is of substantial relevance.

In the last few years, the area of prediction of pri-miRNA transcripts has been attracting the attention of researchers [7]–[12]. Understandably, the major focus in this direction is on intragenic miRNAs, i.e. miRNAs located within a gene, as they are co-transcribed with their host genes. Limited work has been conducted for studying the TSSs of intergenic miRNAs, those located between genes. A recent study highlights that miRNA TSSs are different from the TSSs of genes and therefore need specific prediction models [13]. A classification model based on support vector machines (SVM) [14] with a multi-objective optimization based feature selection has been proposed in [13] where known miRNA TSSs are used for training the classifier.

As reported in a current study, intronic, exonic and intergenic regions of DNA exhibit distinct epigenetic characteristics [15]. As of now, only genetic features are considered for TSS identification of miRNAs. But with the development in epigenetics, several new forms of genomewide data have become available. Incorporating features that are based on epigenetic footprints in the DNA appears to be relevant in such studies. There are recent studies in which putative promoters of miRNAs have been identified by analyzing epigenetic features [16]. However, the prediction of exact TSS is a somewhat different problem. In the current analysis, we have collected a large set of genetic and epigenetic features (even though epigenetic footprints in the DNA are also genetic features [17]), some of which are novel, to predict TSSs of human miRNAs. In particular, features based on DNA methylation are employed for the first time for miRNA TSS recognition, to the best of our knowledge. This type of epigenetic modification is of particular relevance, as its influence on promoter regulation has been established before in numerous studies (e.g. reviewed in [18]). Baer et al. have recently reported extensive DNA hypermethylation and hypomethylation in miRNA promoters (identified manually) in association with aberrant miRNA expression in chronic lymphocytic leukemia [16]. To facilitate such studies, we have proposed here a machine learning approach to precise TSS identification. Furthermore, in higher vertebrates DNA methylation nearly exclusively appears in the CpG context, where the methylated state of this dinucleotide is the default case [19], [20]. Unmethylated CpGs are often found clustered in so called CpG islands [21], [22], which play an important role in gene regulation. To test whether this relationship also exists for miRNAs, we have included several features based on CpG island characterizations into the analysis [13].

Notably, the epigenetic modifications are tissue-specific [23]. Therefore, the miRNAs expressed in a type of tissue should exhibit distinct epigenetic features. Here, we utilize the available brain-specific methylation data for the prediction of TSSs of miRNAs expressed in the brain. We employ a classifier model based on a Random Forest (RF) [24]. The information on brain tissue-specificity has been collected from available literature. Several recently experimentally validated primary transcripts and associated TSSs have been used for this purpose. Features based on methylation patterns in the genomic region around the TSSs are employed. CpG island based features, in addition to a number of genetic features, are also included [13]. We use a recently proposed feature selection method based on Variable Weighted Maximal Relevance Minimal Redundancy criterion [25]. Finally, the classifier is assessed by cross validation and further tested on independent data.

Results

First, experiments were conducted to determine whether methylation based information is essential for identifying TSSs of miRNAs expressed in the brain. In the second part of the study, we have analyzed the importance of each of five different categories of features. Next, we have applied the VWMRmR feature selection algorithm and constructed the classification model based on the training dataset with reduced dimensionality. Finally, the performance of the proposed model was compared with those of some other approaches using the prediction results on an independent test dataset.

Selection of the Best Feature Set

Many genomic regions across the entire genome, that appear to be CpG islands due to repeat elements [22], [26], might increase the number of false positives during promoter prediction. So we study only the non-repetitive part of the sequence, as done in [13], for CpG island determination. Current studies on several organisms show that promoters exhibit specific methylation patterns [15], [27]. Inspired by these, we have conducted an experiment to observe whether the inclusion of methylation-based features improves the classification performance for miRNA TSS prediction or not. For this purpose, we have prepared two types of dataset corresponding to two different feature sets NMPLSCI and NMPLSCIMT. Each of these datasets has samples of which correspond to brain-tissue specific TSS samples while are negative TSS samples. Subsequently, we have trained two separate RF models based on each of the two datasets. The average five-fold cross-validation results, computed over ten independent runs of these two models have been listed in Table 1.

Download:

Table 1. Performance of the brain-tissue specific miRNA TSS prediction model with and without methylation-based features alongside the other features.

https://doi.org/10.1371/journal.pone.0066722.t001

As can be seen from Table 1, the feature set combination with MT provides better results than the other feature set in terms of all of the five evaluation criteria, i.e., accuracy, sensitivity, specificity, precision and . This result demonstrates that inclusion of methylation based features not only improves the prediction capability of the proposed model but also indicates that tissue specific methylation analysis is important.

Significance Analysis of Features

To assess the importance of the different features including the methylation based features (MT) introduced in the present study, the F-scores [28] are computed. If the number of positive and negative samples are and , respectively, then the F-score of the feature is computed as(1)

Here, , and stand for the mean values of the feature over the set of the entire positive samples, the entire negative samples and total samples, respectively. Again, denotes the feature of the positive sample and represents the feature of the negative sample. A larger value of F-score is an indicator of a more discriminative feature. All features were ranked based on their F-scores, where a larger value gains a lower (better) rank. The summary of the F-score analysis for different feature subsets is shown in Table 2. In this table, rankwise importance of all the five aforementioned feature sets is displayed separately. Additionally, the NM feature set is partitioned into five different subsets, namely, NM-CG (all possible n-mer containing CG as a substring), NM-1 (for 1-mers), NM-2 (for 2-mers), NM-3 (for 3-mers) and NM-4 (for 4-mers). As can be seen from Table 2, the class of special features (S) comprises better discriminators (as highlighted by the ranks) of the TSS pattern than the other classes. Note that, only three features out of the total belong to the category of special features (S). Even though these few features may not be sufficient by themselves to identify brain-specific miRNA TSS, this analysis underlines the importance of their inclusion. It is also evident from the table that all CpG island based features have been ranked within the top in the total ranked list. This observation once again confirms that they are very useful for TSS prediction of miRNAs [13]. Furthermore the average rank of the NM-CG features is which is less than half of the average rank found using NM. This signifies that NM-CG is also an effective feature set. In fact, recent reports highlight that epigenetic marks also depend upon DNA sequences [17].

Download:

Table 2. Analysis of the importance of features by F-score.

https://doi.org/10.1371/journal.pone.0066722.t002

A major drawback of the F-score is that the mutual information among different features is ignored [28]. To overcome this deficit, we have applied VWMRmR on the full feature set, which produces a sorted ranked list of the features. The summary of the analysis of feature importance for the same ten feature subsets (as was shown in Table 2) is provided in Table 3. Similar to the analysis of features importance by F-score, this table also confirms that features in the “S” category need to be included in the miRNA TSS feature set. This analysis confirms that CI is a good feature subset. Additionally, almost the same observation is found about NM-CG like the F-score analysis. The methylation features appear to gain in importance as compared to the F-score analysis. Indeed, in the top features, now there are MT features compared to only feature appeared in the F-score analysis. Also for the VWMRmR the best rank for a methylation feature is obtained at position , whereas for F-score this value was .

Download:

Table 3. Analysis of the importance of features by VWMRmR feature selection.

https://doi.org/10.1371/journal.pone.0066722.t003

Performance Evaluation on an Independent MiRNA TSS Dataset

There are several gene TSS prediction tools developed to date [29]–[31]. Almost all are based on machine learning approaches by using TSS samples of protein-coding genes. However, the recent investigations suggest that miRNA TSSs can be improved by applying miRNA-specific training datasets [13]. Therefore we have tested our model, incorporating tissue specificity and methylation features, on an independent test set.

The performances of three existing gene TSS prediction algorithms were compared with that of our proposed brain-specific miRNA TSS prediction model on an independent miRNA TSS dataset described in the Materials section. The first method, CoreBoost_HM, is a recently developed RNA polymerase II core-promoter prediction tool that is entirely dedicated to the human genome [29]. In this tool, explicit features based on genome-wide histone modification are incorporated together with features relating to DNA sequence. The second tool, Dragon TSS Desert Masker (DDM), is a well-known gene TSS prediction tool that not only recognizes large segments of mammalian genomes as non-TSS locations (NTL) but also identifies true TSSs with high accuracy [30]. This research also reveals that approximately above of the human genome are most likely NTLs. The classification results employing the DDM tool are obtained by setting the sensitivity level (approx. percentage of real TSSs not masked) to medium (%). The last tool, Easy Promoter Prediction Program (EP3), is a core promoter prediction model developed using large-scale structural features of DNA [31]. In this tool, the default window size of is used for obtaining the classification results.

The comparative performance of the methods has been assessed in terms of five evaluation criteria, namely, accuracy, sensitivity, specificity, precision and using that test dataset. The classification results of these four prediction models are listed in Table 4. It can be observed from the table that the proposed prediction model outperforms all other prediction tools in terms of three evaluation criteria, i.e., accuracy, sensitivity and MCC. The accuracy, sensitivity, specificity, precision and of the proposed model are %, %, %, % and , respectively. Although the specificity and the precision obtained using CoreBoost_HM are higher than those found using our miRNA TSS model, its sensitivity value ( = %) is extremely low as compared to that of our model. In comparison with DDM, the proposed model provides better results in each of the aforementioned five evaluation criteria. Although the specificity and precision obtained with EP3 are higher than those of the proposed approach, the prediction power of EP3 recognizing true TSSs is very poor. The proposed model is the only one that achieves greater than sensitivity as well as specificity. To summarize, incorporation of methylation data is found to be effective in predicting TSSs of miRNAs expressed in the brain.

Download:

Table 4. Comparison of the performance of three existing gene TSS prediction algorithms along with our proposed method in predicting brain-tissue specific miRNA TSS.

https://doi.org/10.1371/journal.pone.0066722.t004

Discussion

The present article deals with the problem of predicting TSSs of miRNAs by incorporating several novel epigenetic features along with the other existing relevant sequence based features. The study on brain-specific miRNAs since the methylation data is available for brain tissue. A sophisticated RF classification model has been constructed using a brain-specific miRNA TSS dataset. The positive samples in this miRNA TSS dataset were collected from a recent miRNA TSS database designed using high-throughput sequencing data. We have evaluated the prediction capability of the brain-specific TSS prediction model using an independent miRNA dataset. The performance of this model is compared to those of some other existing machine learning based gene TSS prediction models. The computational results demonstrate that the proposed model performs very well as compared to existing methods being the only one that provides both a sensitivity and specificity above .

In the future, we plan to include additional epigenetic features like histone modification and activation of small non-coding RNAs. We are also trying to collect additional positive samples in order to assemble a well-balanced brain-specific miRNA TSS training dataset. Studies on other tissues is another important direction of future work.

Materials

A set of brain specific miRNAs was collected by a literature survey. Then, the reported TSSs were divided into training and test sets as described below. Furthermore, the feature set used is described in detail.

Sample Collection

We have carried out extensive literature survey to collect more than eighty brain-specific miRNAs (see Text S1 for more details). We have extracted the positive TSS samples corresponding to these miRNAs and further prepared an effective negative set for training the TSS prediction model. We have also accumulated a separate set of TSS samples for further testing purposes. The methylation data is obtained from MethylomeDB [23] which reports genomewide methylation patterns based on the hg18 genome assembly. We have mapped all the data resources used in this study to the hg18 genome build.

A few recent studies attempted to experimentally verify the TSSs of miRNAs. A detailed review on this can be found in [32]. Chien et al. were the first to apply high-throughput sequencing to identifying miRNA TSS [12]. They provide exact TSS information, rather than a region, for human miRNAs. From this large set of miRNAs, human miRNAs, which correspond to different TSS loci, are identified as brain-specific based on our literature survey (see Text S1). The methylation map we used is given in hg18 at a single base resolution. So, we have converted the others. Since the TSS information has been mapped to the hg19 genome build, we have further mapped it to the hg18 version using the Lift-Over tool of GALAXY [33]. We extract a bp stretch of genomic sequence, that includes bp upstream and bp downstream region around each miRNA TSS, from the UCSC genome browser (NCBI36/hg18 genome build) [34]. All these brain-specific samples comprise positive training data for the prediction model. To our knowledge, no benchmark set with negative samples for brain-tissue specific miRNA TSS is available in the literature. In recent papers, the importance of adding negative samples for making a robust biological prediction model has been highlighted. For the TSS prediction problem, we have selected negative samples (in the form of bp sequence) randomly from the entire genome in such a way that no known miRNA lies within a region of kb either upstream or downstream of the corresponding sample loci, as no TSS is likely to be found at a locus that is within kb of the end of the corresponding miRNA [11]). In this way, a total of samples ( positive samples and negative samples) have been collected as the training data. Several existing and novel features have been extracted from these TSS samples, as described later in this section, to comprise the final training dataset.

We prepared an independent set of test data for validating the performance of the classifier. For this purpose, we have used the information provided in Marson et al. [9]. They report several miRNA TSSs defined over a stretch of bp or more. The data for only the brain-specific miRNAs are considered here. A region around the center of the bp stretch is taken as a positive TSS sample. Ninety such positive samples have been collected. Ninety negative samples have also been collected as described earlier. This provides a set of independent test samples.

Description of Features

For the prediction of brain-tissue specific miRNA TSS, a large number of features has been generated based on diverse sequence characteristics as well as epigenetic properties. Some of these were used in [13], while some are new. These can be grouped into five different categories as follows:

1. N-mer Features (NM).

The frequencies for -mers (for = 1, 2, 3 and 4) are collected from a sequence by considering only its valid subsequence segment. A valid subsequence is represented as a portion of a given sequence that contains only the four bases ‘A’, ‘T’, ‘G’ and ‘C’. In contrast, an ‘N’ is used to denote an undefined base. As the -mer based features are taken from diverse samples, they are normalized by dividing with the length of the corresponding valid sequence segment. In this way, a total features are obtained.

2. Palindromic Features (PL).

The occurrence of several palindromic subsequences with half length , , and are extracted from the valid portion of the given sample sequence. Similar to -mer features generation, their frequencies are normalized by dividing each of them by the length of the corresponding valid sequence portion. In this manner, a total of features are collected.

3. Special Features (S).

We include three over-represented special subsequence patterns that are frequent in promoters [35]. The different forms of these three patterns are: G**G, G**G**G and GC**GC**GC in which the wildcard character ‘*’ represents either one of A/T/C/G. Analogously to the above two feature categories, these three features are also normalized.

4. CpG Island Based Features (CI).

According to Gardiner-Garden et al., a genomic region that contains higher density of G+C and CpG than average in the whole genome is called a CpG islands [21]. A large fraction of human promoters comprises high CpG content [36]. Some studies related to CpG islands emphasizes that unmethylated CpGs are frequently found in clusters inside the CpG islands [21], [22]. This cluster formation plays a significant role for determining the patterns of gene regulation. Usually, CpG islands are characterized by two feature values, the value of CpG O/E (CpG observed over expected ratio) and G+C content (cumulative occurrence of C and G). These values are calculated along the sequence with a sliding window approach. Determining a suitable window length is a challenging job. In a recent study of Hackenberg et al., the problem of choosing the ad hoc value for the length of examined region has been addressed [37]. A number of CpG-related studies highlight that CpG-islands can be better characterized by considering only the non-repetitive portion of the sequence rather than the entire sequence [13]. This is possibly because many regions that comprise repeat elements (like Alu repeats), which are abundant in the genome, resemble CpG islands [22], [26]. Therefore many false-positive regions may come into view as CpG-rich promoters. Inspired by this observation, both the CpG O/E and G+C pair values are computed from the non-repeated portion of the given region of interest. These values can be calculated either with overlapping or non-overlapping sliding windows. Inspired from an earlier observation [13], we have considered non-overlapping windows of lengths { bp, bp, bp and bp} over the entire region of interest. The CpG O/E value is calculated as , where denotes the length of the non-repeated sequence analyzed. On the other hand, G+C content is calculated as . In this way, a total features have been defined.

5. Methylation Based Features (MT).

DNA methylation is a common epigenetic modification of cytosines in CpG dinucleotides. Unmethylated CpGs cluster in CpG islands. We use the recently published database MethylomeDB [23] to compute MT features for the positive and negative samples of bp regions. This database offers genome-wide DNA methylation profiles corresponding to brain-tissue of both human and mouse. There are a total of human brain samples corresponding to three different cortical regions, namely, dorsolateral prefrontal cortex (dlPFC), ventral prefrontal cortex (vPFC) and auditory cortex (AC). Among these samples, ( dlPFC, vPFC and AC) are schizophrenia disease samples whereas ( dlPFC, vPFC and AC) are non-psychiatric controls. For the present research work, we have analyzed only the methylation patterns from non-psychiatric controls. For each of the specified regions, the methylation score is computed based on the methylated sites falling within that region. Let be the probability of a site () being methylated, within the region under consideration, and be the sequence read coverage. Then, the feature value is computed aswhere denotes the count of CpG islands in the region studied. The rationale behind this normalized score is to give importance to higher methylation probability and penalizing it for lower read coverage (see Text S1 for more details). In this way, total MT features are generated, one for each of the non-psychiatric control samples.

Methods

The feature selection algorithm, the RF based classification model and the brain-tissue specific TSS prediction models are described in the following subsections.

Feature Selection Algorithm

For many real-life applications, feature selection is necessary because a lot of the features are irrelevant or redundant [38]. Feature selection algorithm differ in the strategy employed for searching for feature subsets and in the score that measures the importance of a feature subset. Mutual information is widely used in feature selection algorithms due to its ability to identify non-linear dependence between two features. Mutual information between two random variables measures the mutual dependence between the two variables. The Variable Weighted Maximal Relevance Minimal Redundancy criterion based feature selection (VWMRmR) [25] is a recently proposed algorithm that utilizes an existing normalized variant of mutual information [39] to compute both the class relevance as well as the average redundancy of the candidate feature. Earlier approaches like the Maximal Relevance Minimal Redundancy criterion based feature selection algorithm (mRMR) [40], Normalized Mutual Information based Feature Selection (NMIFS) [41] and Improved Normalized Mutual Information based Feature Selection (INMIFS) [42], considered the weight of class relevance and the average redundancy equally, and these two weights have been retained throughout the steps of feature selection. The VWMRmR approach is a weighted version of the mRMR method in which the weight of the average redundancy is continuously increased with respect to the number of features that have already been selected while a fixed weight value is set for the class relevance. The performance of the VWMRmR has been evaluated to be superior to several other existing mutual information based feature selection algorithms, namely, maximal relevance based feature selection (MR), mRMR and INMIFS, based on analyses of six real-life high dimensional datasets. In this article we have selected the topmost features according to VWMRmR.

The RF based Classification Model

An RF has been trained for the purpose of building a classification model. The WEKA software [43] has been used for this purpose. There are two important parameters that need to be set, i.e., numFeatures (the number of features to be employed in each random selection) and numTrees (the number of decision trees to be produced). For the purpose of validation, we have set both of these values to based upon sensitivity analysis. The performance of the corresponding RF model has been assessed using five-fold cross validation and this was repeated five times to obtain a single mean estimate. Five evaluation criteria, namely accuracy (), sensitivity (), specificity(), precision() and Matthews correlation coefficient (), are used. These are defined as follows:where , , and denote the number of true positives, true negatives, false positives and false negatives, respectively.

Proposed Brain-tissue Specific MiRNA TSS Prediction Model

From the training data, a set of features was extracted, as described earlier. Then the VWMRmR algorithm [25] was applied to select the top features. These were used to train a RF-based classifier as already described. This model was used for a brain-tissue specific miRNA TSS prediction. Here we have posed the problem of TSS identification as a binary classification problem. The capability of this model was assessed using an independent testing data as described in the Results section.

Supporting Information

Text S1.

Details about the collection of brain-specific miRNAs, preparation of miRNA TSS dataset, and the construction of the methylation-based feature score.

https://doi.org/10.1371/journal.pone.0066722.s001

(PDF)

Author Contributions

Conceived and designed the experiments: MB TL SB. Performed the experiments: TB MB LF. Analyzed the data: TB MB LF. Contributed reagents/materials/analysis tools: TB MB LF TL SB. Wrote the paper: TB MB LF TL SB.

References

1. Fabian MR, Sonenberg N, Filipowicz W (2010) Regulation of mRNA translation and stability by microRNAs. Annu Rev Biochem 79: 351–379.
- View Article
- Google Scholar
2. Bartel DP (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116: 281–297.
- View Article
- Google Scholar
3. Bartel DP (2009) MicroRNAs: target recognition and regulatory functions. Cell 136: 215–233.
- View Article
- Google Scholar
4. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, et al. (2009) miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res 37: D98–D104.
- View Article
- Google Scholar
5. Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ (2008) miRBase: tools for microRNA genomics. Nucleic Acids Res 36: D154–D158.
- View Article
- Google Scholar
6. Trujillo RD, Yue SB, Tang Y, O’Gorman WE, Chen CZ (2010) The potential functions of primary microRNAs in target recognition and repression. EMBO J 29: 3272–3285.
- View Article
- Google Scholar
7. Saini HK, Griffiths-Jones S, Enright AJ (2007) Genomic analysis of human microRNA transcripts. Proc Natl Acad Sci U S A 104: 17719–17724.
- View Article
- Google Scholar
8. Fujita S, Iba H (2008) Putative promoter regions of miRNA genes involved in evolutionarily conserved regulatory systems among vertebrates. Bioinformatics 24: 303–308.
- View Article
- Google Scholar
9. Marson A, Levine SS, Cole MF, Frampton GM, Brambrink T, et al. (2008) Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell 134: 521–533.
- View Article
- Google Scholar
10. Ozsolak F, Poling LL, Wang Z, Liu H, Liu XS, et al. (2008) Chromatin structure analyses identify miRNA promoters. Genes Dev 22: 3172–3183.
- View Article
- Google Scholar
11. Corcoran DL, Pandit KV, Gordon B, Bhattacharjee A, Kaminski N, et al. (2009) Features of mammalian microRNA promoters emerge from polymerase II chromatin immunoprecipitation data. PLoS One 4: e5279.
- View Article
- Google Scholar
12. Chien CH, Sun YM, ChangWC, Chiang-Hsieh PY, Lee TY, et al. (2011) Identifying transcriptional start sites of human microRNAs based on high-throughput sequencing data. Nucleic Acids Res 39: 9345–9356.
- View Article
- Google Scholar
13. Bhattacharyya M, Feuerbach L, Bhadra T, Lengauer T, Bandyopadhyay S (2012) MicroRNA transcription start site prediction with multi-objective feature selection. Stat Appl Genet Mol Biol 11: Article 6.
14. Vapnik V (1995) The nature of statistical laerning theory. New York: Springer.
15. Zemach A, McDaniel IE, Silva P, Zilberman D (2010) Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328: 916–919.
- View Article
- Google Scholar
16. Baer C, Claus R, Frenzel LP, Zucknick M, Park YJ, et al. (2012) Extensive promoter DNA hypermethylation and hypomethylation is associated with aberrant microRNA expression in chronic lymphocytic leukemia. Cancer Res 72: 3775–3785.
- View Article
- Google Scholar
17. Schübeler D (2012) Epigenetic islands in a genetic ocean. Science 338: 756–757.
- View Article
- Google Scholar
18. Novik KL, Nimmrich I, Genc B, Maier S, Piepenbrock C, et al. (2002) Epigenomics: genome-wide study of methylation phenomena. Curr Issues Mol Biol 4: 111–128.
- View Article
- Google Scholar
19. Deaton AM, Bird A (2011) CpG islands and the regulation of transcription. Genes Dev 24: 1010–1022.
- View Article
- Google Scholar
20. Illingworth RS, Bird A (2009) CpG islands - ‘A rough guide’. FEBS Lett 583: 1713–1720.
- View Article
- Google Scholar
21. Gardiner-Garden M, Frommer M (1987) CpG islands in vertebrate genomes. J Mol Biol 196: 261–282.
- View Article
- Google Scholar
22. Takai D, Jones PA (2002) Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci U S A 99: 3740–3745.
- View Article
- Google Scholar
23. Xin Y, Chanrion B, O’Donnell AH, Milekic M, Costa R, et al. (2012) MethylomeDB: a database of DNA methylation profiles of the brain. Nucleic Acids Res 40: D1245–D1249.
- View Article
- Google Scholar
24. Breiman L (2001) Random forests. Mach Learn 45: 5–32.
- View Article
- Google Scholar
25. Bandyopadhyay S, Bhadra T, Maulik U Variable weighted maximal relevance minimal redundancy criterion for feature selection using normalized mutual information. Communicated.
26. Zhao Z, Han L (2009) CpG islands: Algorithms and applications in methylation studies. Biochem Biophys Res Commun 382: 643–645.
- View Article
- Google Scholar
27. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462: 315–322.
- View Article
- Google Scholar
28. Chen YW, Lin CJ (2006) Combining SVMs with various feature selection strategies. In: Feature extraction, foundations and applications, Springer. 315–324.
29. Wang X, Xuan Z, Zhao X, Li Y, Zhang MQ (2008) High-resolution human core-promoter prediction with Coreboost HM. Genome Res 19: 266–275.
- View Article
- Google Scholar
30. Schaefer U, Kodzius R, Kai C, Kawai J, Carninci P, et al. (2010) High sensitivity TSS prediction: Estimates of locations where TSS cannot occur. PLoS One 5: e13934.
- View Article
- Google Scholar
31. Abeel T, Saeys Y, Bonnet E, Rouzé P, de Peer YV (2008) Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res 18: 310–323.
- View Article
- Google Scholar
32. Bhattacharyya M, Das M, Bandyopadhyay S (2012) miRT: A database of validated transcription start sites of human microRNAs. Genomics Proteomics Bioinformatics 10: 310–316.
- View Article
- Google Scholar
33. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, et al. (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 19: 1–21.
- View Article
- Google Scholar
34. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, et al. (2004) The UCSC table browser data retrieval tool. Nucleic Acids Res 32: D493–D496.
- View Article
- Google Scholar
35. Anand A, Pugalenthia G, Fogel GB, Suganthan PN (2010) Identification and analysis of transcription factor family-specific features derived from DNA and protein information. Pattern Recognit Lett 31: 2097–2102.
- View Article
- Google Scholar
36. Saxonov S, Berg P, Brutlag DL (2006) A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci U S A 103: 1412–1417.
- View Article
- Google Scholar
37. Hackenberg M, Previti C, Luque-Escamilla PL, Carpena P, Martínez-Aroza J, et al. (2006) CpGcluster: a distance-based algorithm for CpG-island detection. BMC Bioinformatics 7: 446.
- View Article
- Google Scholar
38. Duda RO, Hart PE, Stork DG (2000) Pattern Classification. New York: John Wiley and Sons.
39. Strehl A, Ghosh J (2002) Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617.
- View Article
- Google Scholar
40. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: Criteria of maxdependency, max-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27: 1226–1238.
- View Article
- Google Scholar
41. Estevez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20: 189–201.
- View Article
- Google Scholar
42. Vinh LT, Thang ND, Lee YK (2010) An improved maximum relevance and minimum redundancy feature selection algorithm based on normalized mutual information. In: Proceedings of the 10^th Annual International Symposium on Applications and the Internet. Yongin, South Korea, 395–398.
43. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, et al. (2009) The WEKA data mining software: An update. SIGKDD Explor 11: 10–18.
- View Article
- Google Scholar

[ref1] 1. Fabian MR, Sonenberg N, Filipowicz W (2010) Regulation of mRNA translation and stability by microRNAs. Annu Rev Biochem 79: 351–379.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Bartel DP (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116: 281–297.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Bartel DP (2009) MicroRNAs: target recognition and regulatory functions. Cell 136: 215–233.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, et al. (2009) miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res 37: D98–D104.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Griffiths-Jones S, Saini HK, van Dongen S, Enright AJ (2008) miRBase: tools for microRNA genomics. Nucleic Acids Res 36: D154–D158.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Trujillo RD, Yue SB, Tang Y, O’Gorman WE, Chen CZ (2010) The potential functions of primary microRNAs in target recognition and repression. EMBO J 29: 3272–3285.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Saini HK, Griffiths-Jones S, Enright AJ (2007) Genomic analysis of human microRNA transcripts. Proc Natl Acad Sci U S A 104: 17719–17724.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Fujita S, Iba H (2008) Putative promoter regions of miRNA genes involved in evolutionarily conserved regulatory systems among vertebrates. Bioinformatics 24: 303–308.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Marson A, Levine SS, Cole MF, Frampton GM, Brambrink T, et al. (2008) Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell 134: 521–533.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Ozsolak F, Poling LL, Wang Z, Liu H, Liu XS, et al. (2008) Chromatin structure analyses identify miRNA promoters. Genes Dev 22: 3172–3183.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Corcoran DL, Pandit KV, Gordon B, Bhattacharjee A, Kaminski N, et al. (2009) Features of mammalian microRNA promoters emerge from polymerase II chromatin immunoprecipitation data. PLoS One 4: e5279.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Chien CH, Sun YM, ChangWC, Chiang-Hsieh PY, Lee TY, et al. (2011) Identifying transcriptional start sites of human microRNAs based on high-throughput sequencing data. Nucleic Acids Res 39: 9345–9356.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Bhattacharyya M, Feuerbach L, Bhadra T, Lengauer T, Bandyopadhyay S (2012) MicroRNA transcription start site prediction with multi-objective feature selection. Stat Appl Genet Mol Biol 11: Article 6.

[ref14] 14. Vapnik V (1995) The nature of statistical laerning theory. New York: Springer.

[ref15] 15. Zemach A, McDaniel IE, Silva P, Zilberman D (2010) Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328: 916–919.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref16] 16. Baer C, Claus R, Frenzel LP, Zucknick M, Park YJ, et al. (2012) Extensive promoter DNA hypermethylation and hypomethylation is associated with aberrant microRNA expression in chronic lymphocytic leukemia. Cancer Res 72: 3775–3785.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref17] 17. Schübeler D (2012) Epigenetic islands in a genetic ocean. Science 338: 756–757.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref18] 18. Novik KL, Nimmrich I, Genc B, Maier S, Piepenbrock C, et al. (2002) Epigenomics: genome-wide study of methylation phenomena. Curr Issues Mol Biol 4: 111–128.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref19] 19. Deaton AM, Bird A (2011) CpG islands and the regulation of transcription. Genes Dev 24: 1010–1022.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref20] 20. Illingworth RS, Bird A (2009) CpG islands - ‘A rough guide’. FEBS Lett 583: 1713–1720.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref21] 21. Gardiner-Garden M, Frommer M (1987) CpG islands in vertebrate genomes. J Mol Biol 196: 261–282.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref22] 22. Takai D, Jones PA (2002) Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci U S A 99: 3740–3745.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref23] 23. Xin Y, Chanrion B, O’Donnell AH, Milekic M, Costa R, et al. (2012) MethylomeDB: a database of DNA methylation profiles of the brain. Nucleic Acids Res 40: D1245–D1249.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref24] 24. Breiman L (2001) Random forests. Mach Learn 45: 5–32.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref25] 25. Bandyopadhyay S, Bhadra T, Maulik U Variable weighted maximal relevance minimal redundancy criterion for feature selection using normalized mutual information. Communicated.

[ref26] 26. Zhao Z, Han L (2009) CpG islands: Algorithms and applications in methylation studies. Biochem Biophys Res Commun 382: 643–645.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref27] 27. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462: 315–322.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref28] 28. Chen YW, Lin CJ (2006) Combining SVMs with various feature selection strategies. In: Feature extraction, foundations and applications, Springer. 315–324.

[ref29] 29. Wang X, Xuan Z, Zhao X, Li Y, Zhang MQ (2008) High-resolution human core-promoter prediction with Coreboost HM. Genome Res 19: 266–275.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref30] 30. Schaefer U, Kodzius R, Kai C, Kawai J, Carninci P, et al. (2010) High sensitivity TSS prediction: Estimates of locations where TSS cannot occur. PLoS One 5: e13934.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref31] 31. Abeel T, Saeys Y, Bonnet E, Rouzé P, de Peer YV (2008) Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res 18: 310–323.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref32] 32. Bhattacharyya M, Das M, Bandyopadhyay S (2012) miRT: A database of validated transcription start sites of human microRNAs. Genomics Proteomics Bioinformatics 10: 310–316.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref33] 33. Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, et al. (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 19: 1–21.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref34] 34. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, et al. (2004) The UCSC table browser data retrieval tool. Nucleic Acids Res 32: D493–D496.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref35] 35. Anand A, Pugalenthia G, Fogel GB, Suganthan PN (2010) Identification and analysis of transcription factor family-specific features derived from DNA and protein information. Pattern Recognit Lett 31: 2097–2102.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref36] 36. Saxonov S, Berg P, Brutlag DL (2006) A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci U S A 103: 1412–1417.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref37] 37. Hackenberg M, Previti C, Luque-Escamilla PL, Carpena P, Martínez-Aroza J, et al. (2006) CpGcluster: a distance-based algorithm for CpG-island detection. BMC Bioinformatics 7: 446.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref38] 38. Duda RO, Hart PE, Stork DG (2000) Pattern Classification. New York: John Wiley and Sons.

[ref39] 39. Strehl A, Ghosh J (2002) Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref40] 40. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: Criteria of maxdependency, max-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27: 1226–1238.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref41] 41. Estevez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20: 189–201.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref42] 42. Vinh LT, Thang ND, Lee YK (2010) An improved maximum relevance and minimum redundancy feature selection algorithm based on normalized mutual information. In: Proceedings of the 10^th Annual International Symposium on Applications and the Internet. Yongin, South Korea, 395–398.

[ref43] 43. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, et al. (2009) The WEKA data mining software: An update. SIGKDD Explor 11: 10–18.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

DNA Methylation Patterns Facilitate the Identification of MicroRNA Transcription Start Sites: A Brain-Specific Study

DNA Methylation Patterns Facilitate the Identification of MicroRNA Transcription Start Sites: A Brain-Specific Study

Correction

Figures

Abstract

Introduction

Results

Selection of the Best Feature Set

Significance Analysis of Features

Performance Evaluation on an Independent MiRNA TSS Dataset

Discussion

Materials

Sample Collection

Description of Features

1. N-mer Features (NM).

2. Palindromic Features (PL).

3. Special Features (S).

4. CpG Island Based Features (CI).

5. Methylation Based Features (MT).

Methods

Feature Selection Algorithm

The RF based Classification Model

Proposed Brain-tissue Specific MiRNA TSS Prediction Model

Supporting Information

Text S1.

Author Contributions

References