Conceived and designed the experiments: ZP. Performed the experiments: FT BM. Analyzed the data: BM FT SB ZP. Contributed reagents/materials/analysis tools: ZP SB. Wrote the paper: ZP BM SB FT.
The authors have declared that no competing interests exist.
Chromatin immunoprecipitation coupled with high throughput DNA Sequencing (ChIP-Seq) has emerged as a powerful tool for genome wide profiling of the binding sites of proteins associated with DNA such as histones and transcription factors. However, no peak calling program has gained consensus acceptance by the scientific community as the preferred tool for ChIP-Seq data analysis. Analyzing the large data sets generated by ChIP-Seq studies remains highly challenging for most molecular biology laboratories.
Here we profile H3K27me3 enrichment sites in rice young endosperm using the ChIP-Seq approach and analyze the data using four peak calling algorithms (FindPeaks, PeakSeq, USeq, and MACS). Comparison of the four algorithms reveals that these programs produce very different peaks in terms of peak size, number, and position relative to genes. We verify the peak predictions using ChIP-PCR to evaluate the accuracy of peak prediction of the four algorithms. We discuss the approach of each algorithm and compare similarities and differences in the results. Despite their differences in the peaks identified, all of the programs reach similar conclusions about the effect of H3K27me3 on gene expression. Its presence either upstream or downstream of a gene is predominately associated with repression of the gene. Additionally, GO analysis finds that a substantially higher ratio of genes associated with H3K27me3 were involved in multicellular organism development, signal transduction, response to external and endogenous stimuli, and secondary metabolic pathways than the rest of the rice genome.
Chromatin immunoprecipitation (ChIP) coupled with high throughput sequencing (ChIP-Seq) has emerged as one of the most promising tools for profiling protein-DNA binding sites and chromatin modifications on a genome-wide scale
Previous studies have shown that the repressive function of histone 3 modification H3K27me3 is conserved between plants and animals although the modification patterns and the mechanisms by which H3K27me3 is established or maintained may be different
In this report, we identified H3K27me3 modification sites within rice (
Chromatin was isolated from young rice endosperm and fragmented to a size range from 150 to 400 bp. The solubilized chromatin fragments were immunoprecipitated with antibodies against H3K27me3 (Millipore). The recovered DNA fragments were processed for DNA sequencing by Illumina. The reads produced by the Illumina Genome Analyzer were 36 base pairs long. Two DNA samples were analyzed. One was immunoprecipitated DNA samples enriched for H3K27me3, while the other was a control using sonicated genomic DNA fragments. A total of 10,999,931 ChIP reads were produced, and SeqMap
Four peak calling programs (MACS
This image displays a 200 kb region of rice chromosome 1. The top track indicates the positions of all genes identified by TIGR, v6. The next two tracks display the distribution of ChIP and Input (control) reads respectively. The following four tracks display the predicted peaks by each of the four examined peak calling programs. The sequence read numbers were normalized to ensure that the ChIP and the Input had identical read numbers over the total genome. Therefore, the height of the graph in this figure directly correlates with the read number in the region to visually display the DNA enrichment.
Program | Peak Count | Base pair coverage | Percent coverage | Mean peak bandwidth | Standard deviation |
PeakSeq, 200 | 71,269 | 170,523,735 | 43.8% | 2392.7 | 3518.7 |
PeakSeq, 350 | 43,343 | 213,874,615 | 55.0% | 4934.5 | 6805.2 |
PeakSeq, 589 | 23,760 | 268,102,711 | 68.9% | 11283.8 | 14890.5 |
FindPeaks | 41,516 | 35,140,770 | 9.0% | 846.4 | 429.7 |
USeq | 9,094 | 21,031,355 | 5.4% | 2312.7 | 3176.2 |
MACS | 15,738 | 12, 227,095 | 3.1% | 777.9 | 831.5 |
This table shows some basic statistics about the peaks identified by each of the peak calling programs. The percent coverage indicates the percentage of the genome identified as part of a peak. To compute the percentage, a genome size of 389 Mb was used
The results show that the average peak bandwidth called by PeakSeq ranges from 2,393 bp to 11,284 bp depending on the max_gap parameter. Smaller values of the max_gap parameter result in shorter peak bandwidths. In the remainder of this manuscript, all PeakSeq results use max_gap = 200 and the program is referred to as PeakSeq(200). The peaks produced by FindPeaks have an average bandwidth of 846 bp while those identified by Useq average 2,313 bp—similar to those identified by PeakSeq(200). The peak bandwidth identified by MACS is the shortest at 778 bp on average.
FindPeaks identified 41,516 peaks covering 9.0% of the genome, USeq identified 9,094 peaks covering 5.4% of the genome, and MACS identified 15,738 peaks covering 3.1% of the genome. In contrast, peaks identified by PeakSeq with different max_gap values covers from 44% to 68.9% of the genome and identifies from 23,760 to 71,269 peaks.
The identified peaks were further characterized by their frequency of being identified by the four programs and the variation in peak height for each program (
Program | One Support | Two Support | Three Support | Four Support | Peak Height |
PeakSeq | 43.19% | 23.27% | 15.84% | 17.71% | 10.51 |
FindPeaks | 0.94% | 37.32% | 20.79% | 40.96% | 17.85 |
USeq | 0.01% | 16.05% | 38.01% | 45.92% | 17.87 |
MACS | 0.00% | 40.95% | 27.87% | 31.18% | 14.05 |
Table entries show the overlap of peaks identified by each program with those identified by other programs. Each column entry gives the percent of base pairs belonging to peaks identified by the specified program that were also identified as in peaks by other programs. Peak height gives the mean difference between the highest and lowest read counts of the peaks identified by that program.
To correlate gene expression with the H3K27me3 peaks, we carried out a gene expression profile study using an Affymetrix rice whole genome array with the cDNAs of rice endosperm at the same developmental stage as that in the ChIP experiment. TIGR 6 identifies 73,403 genes (including some alternate splice forms), but only 35,522 genes were represented on the gene expression microarray. In order to investigate the relationship between gene expression and ChIP-Seq peaks, only those genes with gene expression values were included in the analysis. These genes were grouped into four categories based on the location of H3K27me3 peaks relative to the gene: “within” the gene, “upstream” of the gene less than 2 kb, “downstream” of the gene less than 2 kb, and “none” within 2 kb either downstream or upstream of the gene. In some cases, a gene could be assigned to multiple categories. We found that this involved a relatively small percentage of the genes (about 10.47% with USeq) and therefore concluded that assigning each gene to a single category would not significantly alter the analysis results.
These pie charts detail the portion of genes falling into each of the four categories based on their position relative to ChIP-Seq peaks: Upstream, Downstream, Within and No peak. The label above each pie chart indicates the program with which the pie chart is associated. In all cases, the blue portion represents the genes with an upstream peak, the red represents genes with a downstream peak, the green represents genes with a peak within the gene, and the purple represents genes with no peak identified.
The expression values were discretized into three categories: high, middle, and low as described in
This figure illustrates the conditional probabilities of gene expression versus peak classification for each of the four peak identification programs. For example, the top bar for MACS indicates that approximately 20% of the genes with a peak upstream had high expression, while about 50% of those with a peak upstream had low gene expression.
Expression Value | Peak Location | FindPeaks | PeakSeq | USeq | MACS |
High | Downstream | 1.00 | 1.00 | 1.00 | 1.00 |
High | None |
|
1.00 |
|
|
High | Upstream | 0.98 | 1.00 | 1.00 | 1.00 |
High | Within | 1.00 |
|
0.66 | 1.00 |
Low | Downstream | 0.65 |
|
|
|
Low | None | 1.00 |
|
1.00 | 1.00 |
Low | Upstream |
|
|
|
|
Low | Within |
|
1.00 | 0.17 |
|
Middle | Downstream |
|
0.73 | 0.22 | 0.59 |
Middle | None | 1.00 | 0.94 | 0.52 | 0.13 |
Middle | Upstream | 0.72 | 0.84 | 0.47 | 0.67 |
Middle | Within | 0.02 | 0.02 | 0.70 | 0.80 |
This table shows the results of using a hypergeometric test to test for statistical significance between the expression value for genes and the peak classification for genes. The expression value and peak location indicate the classification of the genes for expression and peak classification, respectively. The remaining four columns provide the calculated p-values based on peak classifications for each of the peak identification programs. The statistically significant cells are bold. A cut-off of 10−2 was used to identify significant relationships.
In order to further evaluate the peak algorithm results, we selected 18 genes with H3K27me3 peaks and 5 genes without peaks identified visually on the genome browser according to the sequence reads profile. Then we performed ChIP followed by semi-quantitative PCR experiments for these 23 genes.
Antibodies for H3K27me3 were used for ChIP experiments with the chromatin isolated from rice endosperm.
Gene | PCR Result | FindPeaks | MACS | PeakSeq | USeq |
LOC_Os01g04800 | + | + | − | + | + |
LOC_Os01g18440 | + | + | − | + | + |
LOC_Os01g18584 | + | + | − | + | + |
LOC_Os02g07430 | + | + | − | + | + |
LOC_Os02g45850 | + | + | − | + | + |
LOC_Os03g63810 | + | + | − | + | − |
LOC_Os04g41229 | + | − | − | + | − |
LOC_Os05g11130 | + | + | − | + | + |
LOC_Os05g20930 | + | − | − | + | − |
LOC_Os05g48990 | + | − | − | + | − |
LOC_Os06g11330 | + | + | − | + | + |
LOC_Os07g13260 | + | + | − | + | + |
LOC_Os08g02160 | + | − | − | + | − |
LOC_Os08g06370 | + | + | − | + | + |
LOC_Os09g24490 | + | + | − | + | + |
LOC_Os10g39130 | + | − | − | + | − |
LOC_Os11g29870 | + | + | + | + | + |
LOC_Os12g10540 | + | + | + | + | + |
LOC_Os03g09930 | − | − | − | − | − |
LOC_Os01g10504 | − | − | − | − | − |
LOC_Os12g43640 | − | − | − | − | − |
LOC_Os12g44380 | − | − | − | − | − |
LOC_Os07g40570 | − | − | − | + | − |
This table shows the result of the PCR analysis. The first column indicates the TIGR rice gene name. The second column shows the PCR result. The remaining columns indicate the classification of the gene by each of the programs. A “+” indicates a ChIP-enriched gene, while a “−” indicates an unenriched gene.
We next analyzed the GO annotations assigned to the genes with significant enrichment. The GO is divided into three distinct categories: (i) the biological processes (BP) in which the gene product participates; (ii) the molecular functions (MF) that describe the gene product activities, such as catalytic or binding activities, at the molecular level; and (iii) the cellular component (CC) where the gene product can be found. We used agriGO
This figure shows the significant biological process GO annotations for genes with low gene expression and a MACS peak upstream. The top line in each of the boxes lists the GO identifier of the term and the statistical significance (multiple hypothesis corrected p-value, lower is more significant) of that annotation. The middle lines are a description of the GO term. The four numbers on the bottom line are the number of genes with low expression and an upstream peak that had the annotation, the number of genes with low expression and an upstream peak that had any annotation (always 247), the total number of genes that had the annotation and the total number of genes that had any annotation (always 30241). The color of the box is an indication of the significance of the term. White boxes are not significant. The higher level it is, the more significant the GO term is. The color of the arrows indicates the relationship among the GO terms. Black signifies “is_a.” Orange is “part_of;” red is “positive_regulate.” Purple is “regulate,” while green is “negative_regulate.” Long dashes indicate “two significant nodes,” and short dashes mean “one significant node.”
This figure shows the significant molecular function GO annotations for genes with low gene expression and a MACS peak upstream. The notation and coloring are the same as that described in
This figure shows the significant cellular component GO annotations for genes with low gene expression and a MACS peak upstream. The notation and coloring are the same as that described in
We analyzed the performance of four different peak identification programs with ChIP-Seq data from rice endosperm. The programs produced quite different peaks in terms of peak size, number and relative position to a gene. We evaluated the peak predictions using ChIP-PCR and compared the accuracy of peak prediction of these algorithms. PeakSeq identifies a large number of peaks, which cover from 44% to 69% of the genome. However, the identified peaks were not very precise as shown in ChIP-PCR tests and in comparison of the read profiles between ChIP and Input samples in GBrowse. While MACS identified peaks were supported by other peak calling programs, this program might miss many true peaks as shown in our ChIP-PCR verification. USeq identified peaks are also reliable but the program only identifies large peaks. The smaller peaks were not detected or merged with nearby peaks. FindPeaks identified a large number of peaks with various sizes and the identified peaks were mostly reliable as judged by ChIP-PCR and the average peak height. However, FindPeaks does not take the control data into consideration while identifying the peaks. To acquire accurate peak calling results, the control data needs to be considered. A simple compensation method is to subtract the peaks generated by control data alone after normalization with the same program. A variety of differences in these algorithms, such as use of a control dataset and statistical corrections for read counts, can account for the differences. Given that each program has advantages and disadvantages, it is the best to analyze the data with multiple programs to verify the results. Meanwhile, it is still necessary to develop more ChIP-Seq data analysis tools to eventually identify a program that best fits the requirements of regular molecular biology laboratories for ChIP-Seq studies.
Deng's group published an H3K27me3 whole-genome profile from Nipponbare seedling shoots in the four-leaf stage
Despite the differences, all four of the ChIP-Seq data analysis programs reach similar conclusions about the effect of H3K27me3 on gene expression. Its presence upstream of a gene is often associated with repression of expression (
All plants used in this study were rice strain
The chromatin was extracted from endosperm following the protocol of Gendrel
ChIP experiments were carried out as described by Gendrel et al
Input and ChIP samples were processed following Illumina's protocol from the ChIP DNA Sample Prep Kit. Briefly, 10 ng input and ChIP enriched DNA was subjected to end repair, addition of “A” bases to 3′ ends, ligation of adapters, agrose gel size selection for fragments with average size about 186 bp, and PCR amplification to produce a DNA library of adapter-modified fragments. DNA sequencing was carried out using the Illumina/Solexa Genome Analyzer sequencing system at a concentration of 2 to 4 pM. Cluster amplification, linearization, blocking and sequencing primer reagents were provided in the Solexa Cluster Amplification kits and were used according to the manufacturer's specifications.
The generated short reads were mapped onto the genome using SeqMap
Each of the four peak calling programs (MACS
All the ChIP-Seq data were deposit to NCBI's Gene Expression Omnibus (
The support value of a peak by different programs was calculated on a base-pair basis. We labeled each base pair with the programs that identified that base pair as belonging to a peak. We call the number of programs which identified a base pair as belonging to a peak the supporters of that base pair. Then, we counted the number of base pairs with supporters of four (so all programs identified that base pair as belonging to a peak), three, two, and one and listed these support programs with the peak. Finally, we calculated the percentage of base pairs for each support value for each program. For example, if one program identified a total of 200 base pairs as belonging to peaks and of those 200, 60 had a support value of 2, then 30% of the base pairs for that program had support 2.
The height of peaks was calculated on a per-peak basis. We counted the number of ChIP-Seq reads mapping to each position in the peak. The minimum number of reads was subtracted from the maximum number of reads in the peak. We then calculated the mean of this difference over all peaks for each program.
The immature rice seeds were harvested 6–7 days after pollination. After removal of the embryo, the seeds were ground to a fine powder in liquid nitrogen and re-suspended in RNA extraction buffer (50 mM Tris-Cl, pH 8.0, 150 mM LiCl, 5 mM EDTA, pH 8.0, 1% SDS). The mixture was extracted twice with phenol-chloroform and once with chloroform. A five volume of TRIZOL was added to the aqueous phase and then extracted once with chloroform. The RNA was precipitated with isopropanol. After washing with 70% ethanol, the RNA was dissolved into DEPC-H2O. After digestion with RNase-free DNase, the RNA was quantified with the NanoDrop method and qualified with Agilent 2100 Bioanalyzer. A 5 ug of RNA was used as starting material for the microarray experiment. The cRNA probe was labeled and hybridized to the gene chips according to the manufacturer's instruction (Affymetrix). The raw microarray data was extracted from the chip images by using the Gene Chip Operating Software (Affymetrix). All the microarray data is MIAME compliant and the raw data has been deposited to NCBI's Gene Expression Omnibus (
Each raw expression score was log transformed, and then a z-score was computed for each gene:
The z-scores were then discretized into three categories: “high expression,” “middle expression” and “low expression” based on the number of standard deviations from the mean as shown below:
The genes were partitioned into four disjoint sets based on where the peaks were located relative to the gene. If a gene body overlapped any peak, then the gene was labeled “within” (because the peak is within the gene). If any peak fell at most 2 kb upstream of a gene, it was labeled “upstream” (because the peak is upstream of the gene). Likewise, if any peak fell at most 2 kb downstream of a gene, it was labeled “downstream” (because the peak is downstream of the gene). Genes with no peaks within, upstream or downstream were labeled “none” or “no peak.” Each gene received only a single label. The precedence of the labels was “within,” “upstream,” “downstream” and “none.” This analysis was performed for each of the peak identification programs. We also tested using a window of 1 kb upstream of a gene and 500 bp downstream of a gene and obtained similar results.
Conditional probability is the probability that a proposition is true given that another proposition is true
In order to verify the peaks identified by the programs, we selected eighteen genes that were classified as having nearby peaks and five genes that were classified as having no nearby peaks based on the sequence read profile in genome browser for ChIP-PCR verification. Semi-quantification PCR reactions were performed for these genes; the primers used are listed in
Functional categorization of genes was carried out according to the GO rules
All the microarray data was MIAME compliant and that all the microarray and ChIP-Seq data were deposit to NCBI's Gene Expression Omnibus (
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
We thank Dr. Juliet Tan, LSBI, MSU for technical help in transcription profile analysis. This research was approved for publication as Journal Article J-10944 of the MAFES, Mississippi State University.