Search
Advanced Search
Average Rating (0 User Ratings)
    • Currently 0/5 Stars.
    See all categories
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
    Rate This Article
We are still in beta! Help us make the site better and report bugs.

Open Access

Research Article

A High Quality Draft Consensus Sequence of the Genome of a Heterozygous Grapevine Variety

Riccardo Velasco1#*, Andrey Zharkikh2#, Michela Troggio1#, Dustin A. Cartwright1,2, Alessandro Cestaro1, Dmitry Pruss2, Massimo Pindo1, Lisa M. FitzGerald2, Silvia Vezzulli1, Julia Reid2, Giulia Malacarne1, Diana Iliev2, Giuseppina Coppola1, Bryan Wardell2, Diego Micheletti1, Teresita Macalma2, Marco Facci1, Jeff T. Mitchell2, Michele Perazzolli1, Glenn Eldredge2, Pamela Gatto1, Rozan Oyzerski2, Marco Moretto1, Natalia Gutin2, Marco Stefanini1, Yang Chen2, Cinzia Segala1, Christine Davenport2, Lorenzo Demattè1, Amy Mraz5, Juri Battilana1, Keith Stormo5, Fabrizio Costa1, Quanzhou Tao5, Azeddine Si-Ammour1, Tim Harkins4, Angie Lackey4, Clotilde Perbost3, Bruce Taillon3, Alessandra Stella6, Victor Solovyev9, Jeffrey A. Fawcett7, Lieven Sterck7, Klaas Vandepoele7, Stella M. Grando1, Stefano Toppo8, Claudio Moser1, Jerry Lanchbury2, Robert Bogden5, Mark Skolnick2, Vittorio Sgaramella6, Satish K. Bhatnagar2, Paolo Fontana1, Alexander Gutin2, Yves Van de Peer7, Francesco Salamini6, Roberto Viola1

1 IASMA Research Center, San Michele all'Adige, Trento, Italy, 2 Myriad Genetics Inc, Salt Lake City, Utah, United States of America, 3 454 Life Sciences Corporation, Branford, Connecticut, United States of America, 4 Roche Diagnostics Corporation, Roche Applied Science, Indianapolis, Indiana, United States of America, 5 Amplicon Express Inc., Pullman, Washington, United States of America, 6 Technology Park Lodi, Lodi, Italy, 7 Department of Plant Systems Biology, VIB, Gent University, Gent, Belgium, 8 Department of Biological Chemistry, Padova University, Padova, Italy, 9 Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom

Abstract

Background

Worldwide, grapes and their derived products have a large market. The cultivated grape species Vitis vinifera has potential to become a model for fruit trees genetics. Like many plant species, it is highly heterozygous, which is an additional challenge to modern whole genome shotgun sequencing. In this paper a high quality draft genome sequence of a cultivated clone of V. vinifera Pinot Noir is presented.

Principal Findings

We estimate the genome size of V. vinifera to be 504.6 Mb. Genomic sequences corresponding to 477.1 Mb were assembled in 2,093 metacontigs and 435.1 Mb were anchored to the 19 linkage groups (LGs). The number of predicted genes is 29,585, of which 96.1% were assigned to LGs. This assembly of the grape genome provides candidate genes implicated in traits relevant to grapevine cultivation, such as those influencing wine quality, via secondary metabolites, and those connected with the extreme susceptibility of grape to pathogens. Single nucleotide polymorphism (SNP) distribution was consistent with a diffuse haplotype structure across the genome. Of around 2,000,000 SNPs, 1,751,176 were mapped to chromosomes and one or more of them were identified in 86.7% of anchored genes. The relative age of grape duplicated genes was estimated and this made possible to reveal a relatively recent Vitis-specific large scale duplication event concerning at least 10 chromosomes (duplication not reported before).

Conclusions

Sanger shotgun sequencing and highly efficient sequencing by synthesis (SBS), together with dedicated assembly programs, resolved a complex heterozygous genome. A consensus sequence of the genome and a set of mapped marker loci were generated. Homologous chromosomes of Pinot Noir differ by 11.2% of their DNA (hemizygous DNA plus chromosomal gaps). SNP markers are offered as a tool with the potential of introducing a new era in the molecular breeding of grape.

Introduction

Grapes (67 million t; http://faostat.fao.org/site/336/DesktopDefault.aspx) and their derivatives have a large and expanding worldwide market. Grapes can be grown at latitudes from 50°N to 40°S and up to 3,000 meters above sea level, with almost 98% of grape vineyards planted with Vitis vinifera L. ssp. sativa cultivars of Eurasian origin. Ever since the development of wine-making in Iran between 5,440 and 5,000 B.C. [1], wine has been an important component of many cultures. It has been celebrated by the Ecclesiates, by Horace, Goethe, Jefferson and the Nobel laureate J. C. Cela. A traditional icon of the Mediterranean diet [2], the grape has more recently been extensively cultivated in the New World and its cultivation is now moving to Asia. Given grape's content of resveratrol, quercitin and ellagic acid, grape products may contribute to reducing the incidence of cardiovascular and other diseases [3].

V. vinifera ssp. sativa, domesticated from the wild ssp. sylvestris [4], bears hermaphroditic self-fertilizing flowers. However, outbreeding by means of wind and insect pollination is the norm. As a result, cultivars are highly heterozygous and carry many deleterious recessive mutations [5]. Inbreeding depression is severe, so that sterility often ensues from the second or third generation of selfing. All wild Vitis species have 38 chromosomes (n = 19) and most interspecies hybrids are fertile [5]. The high chromosome number suggests a paleopolyploid state of the genome [6], an argument recently presented in the frame of a recent partial assembly of the grape genome [7] but still remaining controversial.

Grape has the potential to become a model organism for fruit trees. The species can be transformed [8] and micropropagated via somatic embryogenesis [9]. Compared to other perennials, the genome size is relatively small, 475 Mb [10], similar to rice (Oryza sativa, 430 Mb; [11]), barrel medic (Medicago truncatula, 500 Mb, http://medicago.org/) and black cottonwood poplar (Populus trichocarpa, 465 Mb; [12]).

In this paper we report a high-quality draft sequence of the grapevine genome. The genome is derived from the Pinot Noir clone ENTAV 115, a variety grown in a range of soils for the production of red and sparkling wines. The sequence provides information on the overall organization, gene content and structural components of the DNA of the 19 LGs of V. vinifera. The Sanger sequencing method was used to generate 6.5X coverage of the genome. This has been integrated with sequence reads generated by a scalable, highly parallel sequencing by synthesis (SBS) method with throughput significantly greater than capillary electrophoresis. The 4.2X coverage provided by SBS was crucial in identifying polymorphic sites and in closing most of the gaps between DNA contigs. This is the first project which utilizes both the longer Sanger and shorter SBS methods to determine the sequence of a large eukaryotic genome.

Results and Discussion

Sequencing and assembly

The DNA of V. vinifera was extracted from young shoots and sequenced and assembled using the whole genome shotgun (WGS) method. Two techniques were adopted: the Sanger dye primer sequencing of paired reads [13] and 454 (SBS) of unpaired reads [14], which provided 6.5X and 4.2X genome coverage respectively (see Materials and Methods).

In order to develop criteria for assembly, a preliminary experiment was conducted to assess heterozygosity: it was found to correspond to approximately 1 SNP per 0.1 Kb and 1 in/del per 0.45 Kb (see Text S1). The assembly program [11] was accordingly modified to accept a specified level of mismatches in overlapping sequences (details in Materials and Methods and in Text S1). The program also incorporated information on clone size, which ranged from 2 to 130 Kb (Table S1).

The assembly started with unique sequences and progressively included sequences with a higher degree of repetitiveness. To avoid merging repeats into a single genomic sequence, the overlapping unique sequence contigs were merged if the rate of polymorphism did not exceed 2% and if the resulting sequence coverage of the overlap did not exceed 150% of the average coverage (see Text S1). These criteria were modified so that contigs with many supporting links were merged. In most cases, this procedure produced a correct assembly.

Applying the procedure to about 6.6 M reads from Sanger sequencing, 90.6% of which represented paired clone ends, 211,374 initial seed contigs of unique sequences were generated. By using long clone links with non-repetitive clone ends, seed contigs were ordered into metacontigs (ordered assembly of contigs, referred to as supercontigs or scaffolds in other publications). After the sequences were merged into 120,000 contigs, data were combined with 4.2 genome-equivalents of SBS data. This helped to identify polymorphic sites and closed 25% of the remaining gaps between contigs. After removal of 10,847 contigs composed only of tandemly repeated sequences and disposal of 7,003 contigs shorter than 1,000 bp, the iterative assembly produced 58,611 contigs (Figure S1 and Table S2) corresponding to 530.9 Mb of genomic DNA. 44,179 of the 58,611 contigs were assembled into 2,093 metacontigs and the remaining 14,432 contigs were singletons. The final assembled sequences are deposited at the EMBL/Genbank/DDBJ databases (accession numbers: AM423240-AM489403, data released 2006-12-19). Metacontig data are available at http://genomics.research.iasma.it. The removed contigs represented mostly centromeric and rRNA gene sequences. Based on their read coverage, their sizes were estimated as 14.5 Mb and 16.3 Mb, respectively.

Cultivated V. vinifera is highly heterozygous. As a result, many of the resulting contigs were consensus sequences derived from an alignment of the two haplotypes. The set of Pinot Noir chromosome pairs included a considerable number of haplotype-specific gaps (sequences present in one haplotype but not in the other; on this issue see also the ‘Pinot Noir genome structure and evolution’ section). The total length of the 1,042,174 identified gaps corresponded to 48.9 Mb. In some chromosomal regions, the two alternative haplotypes were too different for the algorithm employed during assembly to combine them into a single contig. Such separated contigs corresponded to the hemizygous DNA (22,061 contigs with the total length of 65.1 Mb). The total size of the genome represented by different homologous chromosomes can be estimated as twice the length of the sequences represented by the two haplotypes merged into a consensus (416.8×2 = 833.6 Mb), plus the sequence length represented by hemizygous DNA and gaps, respectively 65.1 and 48.9 Mb. After including the centromeric and rRNA regions (14.5×2+16.3×2 = 61.6 Mb), the size of the diploid genome was subsequently estimated to be 1,009.2 Mb, which gives an average 504.6 Mb per haploid genome (Table 1).

thumbnail

Table 1. Number and sizes of assembled sequences in Mb.

doi:10.1371/journal.pone.0001326.t001

A region of 403,443 bp (preliminary experiment; see Text S1) was used to monitor the correctness of the assembly. Thirty four of the 37 contigs which mapped to the preliminary experiment sequence belonged to the metacontig assembled from the full genome sequence and were in the correct order. The remaining three contigs were not included because they contained repetitive clone links. Twenty two of the 36 boundaries between adjacent contigs were overlapping but not aligned due to large heterozygous inserts. The remaining 14 contig pairs corresponded to gaps: nine short gaps between 52 and 354 bp and five gaps larger than 500 bp. The largest gap (2.4 Kb) contained tandem repeats. Most of the gaps were associated to heterozygous inserts of repetitive elements. The total gap size, 8,067 bp, corresponded to about 2% of the region considered.

Metacontig integration into the genetic map

The next phase of the assembly involved positioning metacontigs in the genome using a genetic map developed at the Istituto Agrario di San Michele all'Adige (IASMA). Genetic mapping was based on 94 individuals derived from a F1 Syrah X Pinot Noir cross where the latter was the pollen donor. The map contained 1,006 markers [15], which were used both to anchor BAC contigs to a physical map (http://genomics.research.iasma.it) and to order metacontigs along linkage groups (LGs).

A set of 799 additional SNP markers was developed based on polymorphic sites identified in contigs and was used to anchor and orient metacontigs to LGs. This genetic map included 1,767 molecular markers arranged in 19 LGs covering 1,276 cM (Figure S2; http://genomics.research.iasma.it). The SNP-based markers were also helpful in merging the adjacent metacontigs not previously merged because of repetitive or low-quality links between them.

Integration of the DNA sequence and genetic map of LG4 is shown in Figure 1 (other LGs are in Figure S2). Table 2 summarizes the state of metacontig anchoring to the genetic map. The 2,093 metacontigs covered 477.1 Mb of genomic DNA. Of these, 435.1 Mb were anchored to the 19 LGs and 81.1% of these were oriented by two or more genetic markers (see Text S1). The smallest LG is covered by 26 metacontigs, the largest by 21 metacontigs. The order of markers established by meiotic recombination-based methods was almost co-linear with the metacontigs. In total, 82% of the genomic sequence was mapped to LGs. Most of the unmapped sequences were contained in 1,696 short metacontigs and singleton contigs with multiple tandem repetitive sequences. The assembly of metacontigs and facilitation of their placement on the genome using a genetic map avoided issues related to physical mapping.

thumbnail

Figure 1. V. vinifera genomic metacontigs anchored to the LGs.

V. vinifera genomic metacontigs (yellow bars) positioned along LG 4 of the Syrah x Pinot Noir genetic map. The map was assembled according to Cartwright et al. [113]. On the left are marker names and positions, in centimorgans, from Troggio et al. [15] (http://genomics.research.iasma.it). Most metacontigs were anchored to the map using markers with unique sequence locations: SSRs, BAC-end sequences and SNP-based markers derived from either ESTs or assembled sequences of the two haplotypes of the Pinot Noir genome. Metacontigs without bridge markers were anchored based on their association to other metacontigs (details in Materials and Methods). Approximate size in Kb of each metacontig is indicated on the right. Gaps separating metacontigs are of undefined size.

doi:10.1371/journal.pone.0001326.g001
thumbnail

Table 2. Correspondence, based on 1,356 markers, between the draft genome sequence of V. vinifera, presented in this paper, and the most advanced genetic map produced at IASMA.

doi:10.1371/journal.pone.0001326.t002

Gene annotation and gene content

Five quality levels were adopted for transcript assignment (see Materials and Methods): i) transcripts confirmed by tentative consensus sequences (TCs) and gene predictions (8,110); ii) transcripts confirmed by TCs aligned to the genome (8,160) and among transcripts not confirmed by TC; iii) the retained transcripts predicted at the exon level by different methods (4,028); iv) transcripts which were positive in gene prediction methods with differences at the exon level but with correct gene boundaries (308); v) transcripts which were found by different methods with contrasting results: only genes encoding proteins with significant similarities to known proteins were accepted (8,979). In total 29,585 genes were predicted. Grape gene content is comparable to Arabidopsis (26,819) and markedly different compared with rice (41,046) and poplar (45,555) genomes.

Gene annotation followed a consensus approach. More than 79% of the genes predicted for the grape genome were annotated. Conserved putative grape genes were searched by the BLAST program with rice, poplar and Arabidopsis as references. A decision tree was implemented and used to carry this out. Sets of gene clusters with different levels of similarities among species as well as unique and putative species-specific genes were built. Using strict rules for homology determination, the subset of grape specific genes amounted to 16,859 (Figure 2).

thumbnail

Figure 2. Comparison of four plant genomes based on gene homology.

All genes were compared each other as all-vs-all similarity searches using BLAST. Genes predicted for poplar, Arabidopsis and rice are respectively from www.genome.jgi-psf.org; www.arabidopsis.org; www.tigr.org. Grape gene estimates have been carried out on 58,611 assembled contigs. Genes of similar length with over 60% of similarity alignments at protein level were considered homologous using BLOSUM62 matrix [123]. The frequencies of sequences shared among species are reported on the right.

doi:10.1371/journal.pone.0001326.g002

Functional classification of the predicted genes was carried out by an automatic procedure. The manually revised final classification (Figure S3) shows the functional classes and their percentage in the gene set. Putative grape-specific genes were not characterized by a particular annotation profile or by relative abundance in the functional classes. A slight numerical difference in favour of grape was noted for genes related to lignin biosynthesis and to berry specific pectins. These metabolic pathways are less significant in Arabidopsis and poplar respectively. Genes relative to disease resistance and wine quality are discussed in further detail below.

Disease resistance genes

Resistance to parasites in plants is controlled by the non-host and gene-for-gene pathways [16]. The non-host type was discovered only recently [17], [18]. The gene-to-gene pathway is frequently present in cultivated plants displaying dominant resistance genes, responsible for the initiation of signal transduction leading to deployment of defense mechanisms [19]. The majority of R proteins contain a nucleotide binding site (NBS) and a carboxy-terminal leucine-rich repeat (LRR) domain. The NBS is part of a conserved domain acting as a molecular switch for the signal transduction. The LRR is credited with recognition specificity akin to an antibody-like detector of pathogen effectors [20]. At the N-terminus NBS-LRR proteins carry either the coiled coil (CC) domain or a domain homologous to the Toll/Interleukin-1 Receptor (TIR, [21]), allowing classification of NBS genes into two groups, the CC-NBS-LRR, present in all angiosperms, and the TIR-NBS-LRR, specific to dicotyledonous species [22].

Based on resistance domain analyses, the grape genome was found to contain 341 NBS genes (Figure 3 and Table S3), whereas 207 were found in Arabidopsis [21] and 398 in poplar [12]. The 233 NBS genes which contain the LRR domain can be grouped in 5 major clades (1 to 5 in Figure 3A). The clades were comprised of CC-NBS-LRR, the dictot-specific TIR-NBS-LRR and their truncated structures as follows: (1) mainly TIR-NBS-LRR; (2) and (3) mainly CC-NBS-LRR; (4) mainly NBS-LRR; and (5) CC-NBS-LRR. The CC-NBS-LRR group included 84 genes in grape, 51 in Arabidopsis and 119 in poplar, while the TIR-NBS-LRR group included 37 genes in grape, 64 in poplar and 83 in Arabidopsis. In addition, the grape NBS gene family included 5 truncated TIR-NBS genes, 112 truncated NBS-LRR genes and 103 genes characterized only by the NBS domain (Table S3).

thumbnail

Figure 3. Chromosomal organization of disease resistance genes of V. vinifera.

A) Phylogenetic analysis of NBS-LRR protein sequences of V. vinifera present in Pinot Noir. The phylogeny of these genes is based on a distance-matrix neighbour-joining analysis (Clustal X, [124]; bootstrap of 1000) after alignment of sequences by TCoffee (version 5.05, [125]). The phylogenetic clades, in general, correspond to the classification based on protein domains (however, see text and Table S3). B) Genes assigned to LGs are represented by dots. Their gene number is specified in LG-specific insets and in Table S3. NBS clades (see A above) contain mainly genes of the following classes: (1) TIR-NBS-LRR in blue; (2) CC-NBS-LRRa in green; (3) CC-NBS-LRRb in yellow; (4) NBS-LRR in cyan; (5) CC-NBS-LRR in red. Other resistance genes, belonging to NBS and TIR-NBS groups, are represented by the open and filled dots, respectively. Resistance-related genes different from NBS genes are shown in black. The size of each LG is given in Mb (on the right), whereas markers of the genetic map ([15] and http://genomics.research.iasma.it) are shown on the left, together with the interval in cM between the two closest markers in each gene cluster.

doi:10.1371/journal.pone.0001326.g003

Besides NBS genes, the grape genome contains several signalling components of plant disease response which are encoded by genes EDS1, PAD4, COI1, MPK4, JAR1, ETR1 and NDR1, known to be recruited by resistance gene products (Table S3). The NPR1 gene, a regulator of the systemic acquired response to pathogens [23], is present in one copy in grape and in Arabidopsis, but has five copies in poplar. Likewise, RAR1 and EIN2 are present in single copies in the grape genome.

Genes encoding the pathogenesis-related proteins (PRs, [24]) include nine copies of PR-1, eight of PR-2, five of PR-3, one copy of PDF1, one of PDF2, and several copies of PR5 and protease inhibitor-like genes (Figure 3B and Table S3).

In addition, the grape genome contains eight genes similar to the MLO gene for mildew resistance in barley, compared to the 15 MLO-like genes known for Arabidopsis [25]. MLO proteins belong to a large family of seven-transmembrane domain proteins specific to plants, encoded by genes homologous to barley MLO [25]. MLO recessive alleles confer an effective resistance against mildew pathogens. Furthermore, the powdery mildew non-host resistance-related genes PEN1, PEN2 and PEN3 [17], [18] were found in 5, 5 and 10 copies, respectively.

In grape, the disease-related genes represent a significant part of the genome. In spite of this, many grape varieties, including Pinot Noir, are susceptible to several fungi, such as grey mould (Botrytis cinerea), downy mildew [26] and powdery mildew [27], which have to be kept under control by heavy fungicide treatments. The failure to mount an effective defense response is probably due to a defective pathogen recognition. It is known that NBS-LRR genes are undergoing diversifying selection [28], e.g., variation in the sequence of the Arabidopsis gene RPS2 shows a signature consistent with pathogen-stimulated selection [29]. Moreover, the extent of variation in the activity of NBS-LRR genes may have been affected by balancing selection [30][33]. Grape alleles of the same resistance genes did not co-evolve in the presence of the agriculturally most important grape pathogens [34]. Indeed, allelic variation due to SNPs present in functional resistance domains was associated with the phenotypic divergence between resistant and susceptible genotypes only when susceptible V. vinifera and resistant non-vinifera clones were considered [34]. In addition, the long time interval necessary for the grape to complete one generation, together with its vegetative propagation, makes it difficult to match the evolutionary rates of microbial or insect pests, which in vineyards are boosted by massive use of chemicals [35]. Such detailed knowledge of the grape genome will serve to accelerate the development of genetic strategies to counter crop loss due to dynamic and genetically diverse pathogens.

The TIR-NBS-LRR genes are preferentially located in LG 18, the CC-NBS-LRR genes in LGs 9 and 13 and the truncated NBS genes in LGs 12 and 13 (Figure 3B). The NBS genes are also organized in clusters and superclusters. As noted in Arabidopsis [36], each cluster may include NBS genes of different phylogenetic lineages, although they frequently consist of tandem repeats of the same gene. The heterogeneity of NBS clusters has been discussed and interpreted as a consequence of evolutionary events such as ectopic recombination, chromosomal translocation and gene-cluster remobilization. This type of genome evolution is difficult to explain other than in terms of a hypothesis where a positive selection for cluster complexity provides the basic materials for the generation of new resistance specificities [37].

Several clusters of NBS genes mapped to chromosomal regions where genetic resistance to fungal diseases, such as downy and powdery mildew, were previously assigned (Figure 3B). This included LGs 12 and 18 [38] and LGs 14 and 15 [27], [38], [39]. Thus, the genome sequence of grape indicates candidate NBS genes responsible for extant variation and provides a starting point for breeding grape varieties resistant to important pathogens.

Phenolic and terpenoid pathways

Grape secondary metabolites, particularly polyphenols, have a strong influence on wine quality [40]. Most phenolics derive from phenylalanine via phenylalanine ammonia-lyase (PAL). They encompass a range of structural classes and biological functions and include lignins, phenolic acids such as hydroxycinnamic and hydroxybenzoic acids, and polyphenols such as flavonoids and stilbenes.

Flavonoids are the most common plant phenolics. In flowers and fruits they attract pollinators and seed dispersers and are particularly involved in UV-scavenging and disease resistance [41]. Flavonoids contribute to human health [42]. The flavonoid skeleton, synthesized by chalcone synthase (CHS), is converted to chalcones, flavanones, flavonols, flavanols, anthocyanins and proanthocyanidins (condensed tannins). In red grape, flavanols and anthocyanins are abundant, the latter accumulating mostly in the berry skin and the former in the seeds [43]. In the last decade considerable effort has been made in identifying and cloning grape flavonoid biosynthetic genes [44][47]. The grape genome sequence now offers the opportunity of compiling an exhaustive overview of the phenylpropanoid pathway.

Gene predictions corresponding to all those genes known to encode enzymes of the pathway could now be found. These include C4H and 4CL (acronyms are explicated in note 1 of Figure 4A) which were not previously identified in grape. The majority of genes were organized in large (PAL, F3'5'H) or small (CHS, F3H, FLS, LAR) gene families, the remainder consisting of single copy genes (C4H, 4CL, CHI, F3'H, DFR, LDOX, ANR, UFGT) (Figure 4A; Table S4).

thumbnail

Figure 4. V. vinifera pathways for phenolic and terpenoid biosynthesis.

A) V. vinifera general pathway for phenolics biosynthesis leading to stilbenes (C6-C2-C6) and flavonoids (C6-C3-C6). For each enzyme, the gene copy number is reported in brackets. Genes were identified by similarity search using BLAST where the references were the sequences of phenolic biosynthetic genes previously characterized in grape and in other plant species. Putative homologues and gene copy numbers were determined by comparing aligned amino acid sequences based on a threshold of 80% similarity between the grape sequences, and 60% similarity between grape and other species. For the large StSy, PAL and F3'5'H families, phylogenetic analysis was performed with MEGA4 package [126] after aligning with ClustalW [127]. The following enzymes involved in the pathway are shown: PAL, phenylalanine ammonia-lyase; C4H, cinnamate 4-hydroxylase; 4CL, 4-coumarate-CoA ligase; CHS, chalcone synthase; StSy, stilbene synthase; RSGT, resveratrol glucosyltransferase; CHI, chalcone isomerase; F3H, flavanone 3-hydroxylase; F3'H, flavonoid 3′-hydroxylase; F3'5'H, flavonoid 3′,5′-hydroxylase; DFR, dihydroflavonol-4-reductase; FLS, flavonol synthase; LDOX, leucoanthocyanidin dioxygenase; LAR, leucoanthocyanidin reductase; ANR, anthocyanidin reductase; UFGT, UDP-glucose:flavonoid 3-O-glucosyltransferase; OMT, O-methyltransferase; ACCase, acetyl CoA carboxylase. PA refers to proanthocyanidins. Enzymatic steps that have not been experimentally confirmed are marked with an asterisk (*). B) Steps of plastidic isoprenoid pathway and monoterpenoids biosynthesis. For each enzyme, the gene copy number is reported in brackets. Gene annotation was performed as described in Material and methods. Abbreviations: G3P, glyceraldehyde 3-phosphate; DXP, 1-deoxy-D-xylulose-5-phosphate; MEP, 2-C-methyl-D-erythritol 4-phosphate; CDP-ME, 4-diphosphocytidyl-2C-methyl-D-erythritol; CDP-MEP, 4-diphosphocytidyl-2Cmethyl-D-erythritol 2-phosphate; ME-cPP, 2C-ethyl-erythritol 2,4-cyclodiphosphate; HMBPP, 1-hydroxy-2-methyl-2-(E)-butenyl 4-diphosphate; IPP, isopentenyl pyrophosphate; DMAPP, dimethylallyl pyrophosphate. The enzymes in the pathway are indicated in blue: DXS, 1-deoxy-D-xylulose 5-phosphate synthase; DXR 1-deoxy-D-xylulose 5-phosphate reductase; ISPD, 4-diphosphocytidyl-2-C-methyl-D-erythritol synthase; ISPE, 4-diphosphocytidyl-2-C-methyl-D-erythritol kinase; ISPF, 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase; ISPG, 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase and ISPH 2-C-methyl-D-erythritol 2,4-cyclodiphosphate reductase (ISPG and ISPH are probably the same enzyme and convert directly MEcPP in IPP and DMAPP); ISPA, geranyltransferase; IDI, isopentenyl diphosphate delta-isomerase; PT, prenyltransferase; LIMS, limonene synthase; LIS, linalool synthase; GES, geraniol synthase; TES, α-terpineol synthase; TPS-CIN, myrcene/(E)-beta-ocimene synthase; CYTP450, cytochrome P450 hydroxylase; ALDH, aldehyde dehydrogenase; NADPDH, NADP dehydrogenase. Enzymatic steps that have not been experimentally confirmed are marked with an asterisk (*).

doi:10.1371/journal.pone.0001326.g004

Within the phenylpropanoid pathway, relatively large gene families have been described for poplar compared to Arabidopsis [48]. Our results highlight some significant differences, such as the number of PAL and F3'5'H gene copies which were even greater in grape. In general, grape and poplar secondary metabolism exhibits a tendency toward gene family expansion. Conversely, in Arabidopsis all enzymes of the central flavonoid metabolism, except for FLS, are encoded by single genes [41]. This is consistent with the noted low metabolic investment in flavonoids of Arabidopsis, a species which reproduces without the need for insect pollination and has no perennial woody habit.

In grape, as in a few other species, the condensation of p-coumaroyl-CoA with malonyl-CoA gives rise to stilbenes via stilbene synthase (StSy; [49]). Among stilbenes, monomers and oligomers (viniferins) of resveratrol contribute to resistance to fungal pathogens [50]. Resveratrol has gained attention due to its alleged beneficial effects on human health [51]. Stilbene synthase belongs to a large family: the analysis of the grape genome predicts at least 21 copies. This number agrees well with a recent StSy sequence analysis in infected grape leaves [26] but it differs from the one predicted in the PN40024 grape genome sequence [7]. Most of these copies, as well as most PAL genes, are clustered in LG 16. Further, several peroxidase genes were predicted, some of which could participate in the formation of viniferins, as previously suggested [50]. Recently, a resveratrol glucosyltransferase putatively involved in piceid synthesis has been isolated and biochemically characterized in V. labrusca grape berry [52]. Our analysis revealed that its homolog in Pinot Noir (99% sequence similarity) is present as a single gene mapping on LG 3.

Terpenoids are among the most abundant and structurally diverse group of natural metabolites. Volatile and non-volatile terpenes are essential for plant growth and development (e.g., gibberellin phytohormones), but they are also key players in the interaction of plants with the environment [53]. The substrates for the biosynthesis of about 22,000 terpenes are isopentenyl diphosphate (IPP) and dimethylallyl diphosphate (DMAPP). The mevalonate (MVA) and the mevalonate-independent DOXP/MEP pathways are responsible for the synthesis of IPP and DMAPP in the cytosolic and plastidic compartments respectively [54]. DOXP/MEP is the dominant route for monoterpene biosynthesis in the grape berry [55]. Three prenyltransferases produce terpene precursors, prenyl diphosphates, geranyl diphosphate (GPP), farnesyl diphosphate (FPP) and geranylgeranyl diphosphate (GGPP). Terpene synthases (TPS) catalyze the formation of hemiterpenes [51], monoterpenes (C10), sesquiterpenes (C15) or diterpenes (C20) from the substrates DMAPP, GPP, FPP or GGPP respectively (Figure 4B).

All TPSs are similar in physico-chemical properties. Moreover, the close sequence relatedness of their genes prevents discrimination of their catalytic functions, supporting a rapid divergence of catalytic activity of closely related TPS genes [53]. Three classes of TPSs are described and only classes II and III are specific for the plant secondary metabolism [56]. Forty seven TPS genes participate in the secondary metabolism in poplar [12], while in grape only 35 TPSs were identified, a number close to the 32 found in Arabidopsis. In the grape genome, they are located mainly on LGs 9, 10 and 19 (class I TPs on LGs 7, 9, 10 and 19, Table S5).

Several higher plant genes of the terpenoid pathways have been cloned [57], but only a few of them had previously been identified in grape [58]. Having the complete sequence of the grape genome, 124 genes related to the terpenoid pathway were identified (Table S5). Of these, 110 were mapped to all LGs. Functionally, 24 are related to carotenoids, 24 to abscisic acid metabolism, 10 to gibberellin hormones, and 6 cover steps of the core terpenoids pathway: 5 prenyltransferases and 1 isopentenyl diphosphate delta-isomerase. For the MVA and non-MVA pathways, nine (4 DXS, DXR, ISPD, ISPE, ISPF, ISPH) and eight (2 AACT, HMGS, 3 HMGR, MK/MVK, MVD) putative genes were identified respectively.

Plant monoterpens are preferentially confined to specialized organs. They play an important role in defense as well as acting as allelopathic agents and attractants for pollinators [59]. In grape, monoterpenes contribute to wine free volatiles: typical components of the aroma-rich grape varieties are linalool, geraniol, nerol, citronellol and α-terpineol, which are stored in exocarps and vacuoles. Monoterpene biosynthesis has not yet been studied because several metabolic steps may take place without enzymatic catalysis. Moreover, the knowledge of mechanisms controlling monoterpene synthase activity is still largely incomplete. In the grape genome four monoterpene synthase genes were identified encoding linalool synthase, limonene synthase, myrcene synthase and α-terpineol synthase.

Transcription factors

In grape, 2004 TF genes were identified (Figure 5A and Table S6) which represent 6.7 % of the genome, similar to the 6% for Arabidopsis [60], 4.8 % for rice [61] and 6% for poplar [62]. Among the grape TF genes, 80.6% are present in marker-anchored metacontigs (Figure 5A).

thumbnail

Figure 5. Distribution of transcription factors along the 19 V. vinifera LGs.

A) Distribution of 1,617 transcription factors along the 19 V. vinifera LGs inferred from the positions of anchored metacontigs. Different colours of the histograms corresponds to the different TF classes. B) Distribution of transcription factor clusters over the grape genome. TF organization in LGs 2, 5, 6, 7, 8, 10, 13, 16, 17 is presented. For each LG, markers of the genetic map, developed by Troggio et al. [15] (see also http://genomics.research.iasma.it) are reported on the left together with the interval in cM between the two closest markers for each TF cluster. TF types are reported on the right.

doi:10.1371/journal.pone.0001326.g005

Sixty-two families of TF genes were found, a number similar to the 64 for Arabidopsis, 62 for rice and 63 for poplar [63]. TF families like MYB, AP2/EREBP, bHLH and MADS-box include a large number of members [11], [60]. We compared the number of genes in each of the 60 grape TF families in common to the other three plant genomes: finding a nearly linear correlation (Figure 6). Thus the organization and number of TFs seem to be highly conserved in plant genomes. TF distribution in the grape genome (Figure 5A) indicates that only LGs 7 and 18 have a higher than average TF content. Clusters of AP2/EREBP genes are repeated in tandem on LGs 2, 7, 10 and 16; CCAAT genes on LGs 6, 8, 10 and 13; MADS-box genes on LGs 5 and 13; Myb genes on LG 8 and 17 (Figure 5B).

thumbnail

Figure 6. Scatter plot of the distribution of V. vinifera transcription factors.

For each of the 60 families (1983 genes) of V. vinifera TFs (X-axis) (log base 2 transformed), family members have been plotted against the corresponding number reported for three other genomes: A) A. thaliana (http://arabtfdb.bio.uni-potsdam.de/v1.1), B) P. trichocarpa (http://poplartfdb.bio.uni-potsdam.de/v2.0) and C) O. sativa (http://ricetfdb.bio.uni-potsdam.de/v2.i). The degree of the correlation among TF gene numbers is indicated by the Pearson correlation value (r). Each scatter plot shows the TF families which were statistically over- or under- represented in pair-wise comparisons (χ2 tests were applied to untransformed data; p = 0.05).

doi:10.1371/journal.pone.0001326.g006

Across the species mentioned, MYB (279) are the most abundant [11], [64]. They play a role in controlling the accumulation of secondary metabolites in the grape berry [65][67]. A gene from this TF family is also known to play a key role in the regulation of anthocyanins and flavonols during the non-climacteric ripening of strawberry [68]. Non-climacteric ripening (occurring in fruit such as strawberry and grapevine) is a process characterized by the absence in respiratory pick and ethylene bursts, two phenomena typical of the climacteric fruits ripening.

In the grape genome were also found 143 leucine-zipper genes. Together with EREBP TFs they contribute to the plant's defense response [69]. In tepary and common bean, a bZIP gene plays an important role in the response to water deficit and in the regulation of abscisic acid levels. [70]

In the grape genome, the MADS-box family is also over-represented. These TFs regulate flowering-related phenomena, as well as other metabolisms [71]. MADS-box TFs may have been important during plant evolution because they allow plant reproductive structures to adapt to variations in climatic conditions [72]. It was found that two tandem MADS-box genes (MADS-RIN and MADS-MC) regulate fruit ripening and inflorescence determinancy in the climacteric fruit tomato. Mutation at rin locus caused a failure in the normal ripening physiology [73].

A ripening mechanism common to both climacteric and non-climacteric species, such as grape, has been hypothesized [74]. In support of this, we identified two TF classes in grape, AP2/EREBP and EIL, which contribute to ethylene signalling during ripening of climacteric fruits, and also found ethylene receptors belonging to ETR/ERS families.

Repetitive elements

Matching the sequences of assembled contigs with original reads made it possible to characterize each DNA segment by the number of matching reads (see Materials and methods). For the read coverage of 10.7X, a DNA segment was considered unique when represented by 15 or fewer matches. Moderately repeated sequences (2 to 8 copies per genome) were expected to have 16–100 matches. Sequences with more than 100 matches were considered highly repetitive. They were masked before gene prediction, thus excluding most of the coding parts of repetitive elements from the putative gene set.

Dispersed highly repetitive DNA sequences were identified by an iterative procedure, and the resulting collection of 90,483 repetitive segments were grouped into 136 types. Members of each type were translated and compared to each other and the similarity scores were used in a UPGMA-like clustering. The similarity tree consisted of eight clusters lacking a common root (Figure S4), each of which was assigned to the known classes of repetitive DNA sequences (Table S7).

Grape transposable elements (TEs), totalling 108.5 Mb, represent the most abundant set of repeats. The repeats were included in group I (retrotransposons: Copia, Gypsy, LINE) and group II (DNA transposons: Mutator, CACTA, hAT) according to Feschotte et al. [75]. The most abundant TEs were Gypsy/athila-like elements followed by Copia elements. DNA transposons were represented by 9,562 copies (7.1 Mb). The TEs seem to be more abundant in grape compared to poplar [12], Arabidopsis and rice [11]. Putatively autonomous TEs were identified by significant BLAST analysis against the Uniprot database. TEs without a significant BLAST hit were attributed to the non-autonomous group (Table S7). Out of 136 repeat types, 20 were classified as long tandem repeats with a unit size from 100 to 430 bp. They were grouped into ten major sub-classes.

Short tandem repeats (microsatellites) were also identified. Their thresholds, number of copies and total DNA length are reported in Table S8. Microsatellites cover 2.1 Mb, including the telomeric repeats (TTTAGGG). Out of 171 contigs with identified telomeric sequences, 42 had telomeric ends. In the linkage map, they represent potential markers for telomeres.

An alternative estimation of the length of identified repetitive DNA was performed using the number and total length of reads matching repeat sequences, identified above. This new estimate gave a value of 138.5 Mb, corresponding to 27.4% of the 504.6 Mb genome size.

Non-coding RNAs

MicroRNAs.

MicroRNAs (miRNAs) and trans-acting siRNAs (ta-siRNAs) have a significant role in plant development and stress response [76], [77]. The majority of the 1220 plant miRNAs listed in the miRBAse [78] are from Arabidopsis (184), rice (243) and poplar (215). A BLAST search of sequences similar to the Arabidopsis miRNAs genes was performed on the grape genome. Allowing for three or fewer mismatches, 143 miRNA genes representing 28 families ([78]; Table 3, Table S9) were identified.

thumbnail

Table 3. Distribution of miRNA encoding genes of V. vinifera on LGs and number of their putative target genes grouped in families.

doi:10.1371/journal.pone.0001326.t003

Three types of miRNAs (miR827, miR828 and miR846) were not previously found outside Arabidopsis, and were considered “non-conserved” miRNAs [79], [80]. However, these genes are present in the grape genome, indicating that they were either lost in the lineage leading to Populus or are missing from its genome assembly. The miRNA passenger strands (miRNAs*) are highly conserved between grape and Arabidopsis [79]. Sequences predicted to produce ta-siRNAs [81] are conserved in several plant species, grape included.

Putative grape miRNAs and siRNAs target the same classes of genes as they do in Arabidopsis, rice and poplar: transcription factor genes, genes involved in stress response and nutrient uptake, genes for RNA silencing and the non coding RNA TAS3 (Table 3). In grape, 56 RNA-dependent DNA polymerase genes are potentially targeted by miR396 and miR846, a phenomenon not reported in other plant species.

BLAST searches identified four Dicer-like proteins (Helicase, RNAse IIIa/b domains), nine Argonautes (PAZ/PIWI domains), and six RNA-dependent RNA polymerases (RdRp domains), indicating the presence in the grape genome of a complex RNA processing machinery (Figure S5).

Transfer RNA.

The tRNAscan-SE program [82] identified 719 putative tRNA genes. 163 of them are pseudogenes, 3 are suppressors for the TAA codon while 553 correspond to 52 anticodons for all amino acids (Table S10).

Small nuclear RNA.

Non-coding RNAs include five major and four minor snRNA families, all components of splicing factors. The Arabidopsis snRNA list of Wang and Brendel [83] was used to search for similar sequences in grape. We found 89 snRNA genes and pseudogenes (75 in Arabidopsis) (Table S11). Several snRNA genes were clustered in the genome.

Ribosomal RNA.

Large rRNA units consist of two segments, one hosting the genes 18S rRNA, 5.8S and 28S, the second containing three arrays of tandem repeats. In grape the length of the rRNA unit is around 10.8 Kb. The variable segment includes three arrays of tandem repeats: about 40 copies of a 44–45 bp repeat, three copies of a 150 bp repeat and 5.5 copies of a 193 bp repeat. The unit is repeated 1450–1550 times in the genome (16.1 Mb). rRNA units may contain insertions of retrotransposons of three different lengths (2870, 2950, and 5800 bp). Retroelements in rRNA sequences may cause transposition of rRNA sequences.

The DNA sequence for the small ribosomal RNA unit (1,250 bp) contains two genes for 5S rRNA, 120 bp each with a single nucleotide difference between them. In the genome the unit was represented by 170–180 copies. Together, large rRNA and 5S rRNA sequences were estimated to amount to 16.3 Mb.

Small nucleolar RNA.

Based on the Arabidopsis snoRNA genes [84], 166 sequences representing 79 families were found in grape (Table S12). Most of the grape snoRNA genes (110) are clustered; 62 snoRNA genes were located inside 34 genes encoding six ribosomal proteins and one eIF-4F factor.

Pinot Noir genome structure and evolution

The existence of structural diversity between homologous chromosomes within plant species has been reported [85]. This type of molecular variation seems to be common in allogamous plants