Conceived and designed the experiments: GRC LB TP RO GEM. Performed the experiments: FP RON RO MIT SJH ME MA IM MB. Wrote the paper: IM LB RO GRC FT MB FP. Carried out
The authors have declared that no competing interests exist.
The growing accessibility to genomic resources using next-generation sequencing (NGS) technologies has revolutionized the application of molecular genetic tools to ecology and evolutionary studies in non-model organisms. Here we present the case study of the European hake (
The European hake (
Recently, major technological advances have opened innovative opportunities in the application of molecular genetic tools to study evolutionary processes in natural populations of marine species
The growing accessibility to high-throughput sequencing technologies and the concurrent development of innovative bioinformatics tools has enabled the application of wide-scale SNP discovery based on transcriptome sequencing even in species for which genomic resources are still limited or absent
The present study reports the muscle transcriptome characterization and the discovery and validation of a large set of SNP loci for European hake, based on NGS technologies. Existing NGS technologies provide a variety of approaches for transcriptome characterization, although the two most popular platforms are the Roche 454 FLX
Muscle tissue samples for transcriptome sequencing were collected from four geographic regions (
In brackets the number of specimens in the discovery panel (454 and GAII) and the number of individuals that have been subsequently genotyped in the validation step. NTHS: North Sea (59°19′N, 1°39′E); ATIB: Iberian Atlantic coast (43°20′N, 8°56′W); TYRS: Tyrrhenian Sea (42°32′N, 10°9′E); AEGS: Aegean Sea (40°19′N, 24°33′E).
Total mRNA was obtained from eight individuals (two for each sampling location,
Total RNA was purified from muscle tissue from five individual samples, two for NTHS, and one from each of the three other sampling sites (
The sequences were de-multiplexed based on the specific barcoding tags and binned per individual sample. In order to obtain an optimal
After renaming and trimming of the first base, GAII short reads were assembled
In order to characterize 454 and GAII contigs several approaches were explored, using the Basic Local Alignment Search Tool (BLAST) against various protein and nucleotide databases. The Blastn option was used against the NCBI nucleotide database (cut-off e-value of <1.0 E−5), against all annotated transcripts in the draft genomes of
The Gene Ontology (GO) terms (“Cellular Component”, “Biological Process” and “Molecular Function”) were recovered using the Blastx search tool implemented in the software Blast2GO
A pipeline was developed to characterize the SNP mutations at the amino acid level. To obtain the putative reading frame all SNP-containing contigs were compared against six peptide sequence databases (Ensembl genome assembly for
To evaluate whether SNP-containing contigs were significantly enriched for specific GO terms compared to all annotated hake contigs, the Gossip package
After SNP detection,
Intron-exon boundary prediction was performed using fish genome and transcriptome sequence resources following two parallel approaches. In the first, SNP-containing contigs were compared against five high-quality draft fish genomes (Ensembl genome assemblies for
A final evaluation step of putative SNPs was performed by direct visual inspection of contigs using the assembly viewer software Eagleview
A total of 1,536 candidate SNPs were selected to be validated by high-throughput genotyping. Genomic DNA was extracted from fin clip tissues of 207 individuals sampled from the same four locations of origin of specimens used to derive the libraries (AEGS, TYRS, ATIB, NTHS, see
Different variables were defined to analyze results of SNP discovery and genotyping. The first two are categorical variables that refer to the outcome of individual SNP assays.
Binomial logistic regression, implemented in SPSS ver. 12.0, was used to evaluate several predictor variables on two dependent dichotomous variables:
The Receiver Operating Characteristic (ROC) curve, a widely used method for evaluating the discriminating power of a diagnostic test, was used to assess the significance of specific variables. ROC analysis was implemented in MedCalc ver. 11.5.1.0.
Approximately 100 Mega base pairs (Mbp) were obtained using 454 sequencing technology, whereas nearly 4,000 Mbp were produced using the GAII sequencer (
454 | GAII | |
|
506,772 | 50,685,405 |
|
206 (6–457) bp | 74 bp |
|
462,489 | 50,685,405 |
|
5,702 | 3,756 |
|
331 (100–5,103) bp | 190 (100–3,063) bp |
|
4,221 (74.03) | 2,644 (70.39) |
After pre-processing steps (adaptor clipping and read quality filtering), 462,489 454 reads were assembled
A total of 4,221 454-contigs (74.02%) showed a significant Blast match against at least one species sequence database among those searched. Annotation by similarity was possible for 2,644 GAII contigs (70.39%), significantly less (χ2 = 14.83, p<0.001) than 454 contigs. A potential explanation for this observation lies in the shorter average length of GAII contigs, which likely reduces the overall probability of obtaining positive Blast hits. In any case, the percentage of annotated contigs is higher for both 454 and GAII data when compared to other studies reporting transcriptome sequencing in teleost fish (40–63% of total annotated contigs,
A total of 4,034 candidate SNPs were identified
On the x-axis, number of SNPs per contig; on the y-axis, the percentage of contigs showing a specific number of SNPs.
454 | GAII | |
|
4,034 | 8,606 |
|
889 | 2,384 |
|
617.9 (101–5103) bp | 212.3 (100–3063) bp |
|
89 (4–3,678) | 674 (8–33,079) |
|
3,621 | 4,684 |
|
0.76 | 0.82 |
|
3,437 (94.92%) | 4,637 (99%) |
|
1,322 (38.46%) | 3,389 (73.09%) |
|
851 (24.76%) | 468 (10.09%) |
*Intron/exon boundary pipeline result.
Annotation by similarity of SNP-containing contigs further confirmed that 454-contigs show a higher percentage of putatively annotated sequences (86% vs 71%, χ2 = 72.4, p<0.0001). Blast searches also identified 218 candidate SNPs (positioned in 62 different contigs) that are presumably located on the hake mitochondrial genome. GO term analysis showed a significant enrichment of specific GO terms when comparing the annotations of SNP-containing contigs against all unique transcripts obtained for the hake muscle transcriptome (
Differential distribution of GO terms in SNP-containing contigs (test set) compared to all contigs (reference set) in 454 data (A) and GAII data (B).
A set of 1,536 SNPs (
454 | GAII | |
|
966 (829) | 707 (570) |
|
944 (817) | 684 (557) |
|
409 (334) | 296 (221) |
|
259 (195) | 200 (136) |
|
130 (97) | 73 (45) |
|
110 (83) | 60 (37) |
|
20 (14) | 13 (8) |
|
150 (139) | 96 (85) |
|
535 (483) | 388 (336) |
In brackets the number of SNPs after excluding the set of common loci.
*Data referring to nuclear SNPs.
Excluding mitochondrial loci, 1,501 unique nuclear SNPs were scored, with nearly identical percentages of successful assay conversion for the two sets of data (409/944, 43.32% (454) and 296/684, 43.27% (GAII), χ2 = 0, p = 1). This leads to the question of how do SNP conversion rates compare to those observed in other studies that reported on high-throughput SNP discovery and validation in non-model species. This remains largely unexplored, and the few available data are not homogeneous. Most studies report
Of the successfully converting assays, a slightly (non-significantly) higher percentage (67.5% vs 63.3%, χ2 = 1.18, p = 0.272) of truly polymorphic sites was detected in GAII-SNPs (
While similar conversion rates and percentages of polymorphic loci were obtained for both NGS technologies applied to hake SNP discovery, 454-SNPs showed a significantly higher number of Blast matches against at least one fish model coding sequence (
Box plot of minor sequence allele frequency (MSAF) in the discovery panel and Minor allele frequency (MAF) in the validation panel for 454 data (A) and GAII data (B).
Box plot of observed heterozygosity (Ho) calculated for the discovery panel and the validation panel of 454 data (A) and GAII data (B).
To evaluate the predictive value and thus the potential usefulness of different parameters, all variables were examined together using binomial logistic regression analysis. The success rate for Illumina genotyping assay conversion/clustering show that inclusion of all predictor variables significantly improves model-fitting for 454 (χ2 = 34.944, degrees of freedom (df) 9, p<0.001) and marginally for GAII data (χ2 = 14.641, df 7, p<0.05). The ability to predict the outcome of individual SNP assays is relatively low, with an overall correct classification rate of 60.1% for 454 data and 56.7% for GAII data. Backward stepwise deletion of predictor variables identifies five best predictors (
B |
Wald |
df | P |
|
|
1.735 | 14.607 | 1 |
|
|
−0.361 | 6.545 | 1 |
|
|
1.415 | 5.418 | 1 |
|
|
0.075 | 2.919 | 1 | 0.088 |
|
−0.007 | 5.205 | 1 |
|
|
−2.339 | 13.067 | 1 | 0.000 |
Regression coefficient for individual variable,
Wald χ2 statistic,
associated probability. (
B |
Wald |
df | P |
|
|
0.938 | 2.506 | 1 | 0.113 |
|
1.203 | 8.795 | 1 |
|
|
−1.206 | 7.427 | 1 |
|
|
−2.196 | 6.785 | 1 | 0.009 |
Regression coefficient for individual variable,
Wald χ2 statistic,
associated probability.
A specific statistical analysis (Receiver Operating Characteristic (ROC) analysis) was carried out to evaluate the significance of ADT score as predictive variable. The estimated area under the ROC curve is 0.539±0.015, which is significantly different (z statistic = 2.6, p<0.01) than expected by chance (0.5), confirming ADT score as a significant predictor of assay conversion. The optimal ADT value, which has the best specificity/sensitivity ratio, is 0.735. It should be noted that, while significant, the overall performance of ADT as diagnostic factor is rather limited.
Binomial logistic regression analysis was also implemented to evaluate the outcome of successful SNP assays,
B |
Wald |
df | P |
|
|
−0.127 | 6.369 | 1 |
|
|
6.977 | 2 |
|
|
|
0.676 | 4.232 | 1 |
|
|
0.841 | 4.983 | 1 |
|
|
−0.224 | 3.489 | 1 | 0.062 |
|
0.802 | 10.433 | 1 |
|
|
−0.044 | 8.907 | 1 |
|
|
−2.740 | 5.915 | 1 | 0.015 |
Regression coefficient for individual variable,
Wald χ2 statistic,
associated probability.
B |
Wald |
df | P |
|
|
2.247 | 8.272 | 1 |
|
|
0.345 | 2.735 | 1 | 0.098 |
|
1.805 | 5.162 | 1 |
|
|
−2.188 | 5.581 | 1 |
|
|
−1.577 | 1.361 | 1 | 0.243 |
Regression coefficient for individual variable,
Wald χ2 statistic,
associated probability.
Finally, all experimentally validated SNPs were analysed using a bespoke pipeline, which was developed to compare SNP-containing contigs with known protein sequences and to predict protein-coding regions and the corresponding putative reading frame. It was possible to obtain this information reliably for approximately half of the validated SNPs (
In the present study, two different NGS methods were applied to high-throughput SNP discovery in the muscle transcriptome of a non-model fish species. Overall, the comparison revealed that despite substantial differences in sequence throughput, average sequencing depth, and sequence read length, similar results were obtained after SNP experimental validation in terms of assay conversion rate and percentage of polymorphic loci. GAII technology yields a larger number of candidate SNPs, but the majority of them are not suitable for SNP genotyping due to short flanking regions. On the other hand, 454-SNPs are less numerous, but are located on longer contigs, which are more easily annotated and screened for putative introns in the candidate region using a comparative genomic approach. Although the platforms we have evaluated have been recently upgraded and superseded by later versions, still our findings should remain valid and relevant. Indeed, during the past few years Illumina and 454 platforms have experienced rapid progress towards the enhancement of reads length and yield in terms of Mb per run produced, mainly resulting in increased throughput; however, raw sequencing error rates have not decreased along with the improvement of instruments and chemistry
The SNP markers developed in the current study represent novel tools for a broad range of future applications in population genetic studies focusing on the European hake. A deeper understanding of ecological and evolutionary dynamics of European hake populations across the entire distribution range provides the necessary means for a proper management and conservation policy, aimed at promoting sustainable fishery and preventing overexploitation and illegal fishing activities.
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(XLSX)
We would like to thank all the members of the FishPopTrace Consortium for their input, and Marco Martini and Michele Drigo for their help with ROC analysis. Sampling was made possible by the generous collaboration of Paolo Sartor from the CBMI (Consorzio per il Centro Interuniversitario di Biologia Marina ed Ecologia Applicata “G. Bacci”, Italy), Corrado Piccinetti and Marco Stagioni from the University of Bologna (Italy), Audrey Geffen from the University of Bergen (Norway) and Grigorios Krey from the National Agricultural Research Foundation (Greece). We thank Pernille K. Andersen (Aarhus University, Denmark) for sequencing sample and library management. We are particularly grateful to Jann Martinsohn and Eoin MacAoidh from the European Commission's Joint Research Center, Institute for the Protection and Security of the Citizen (Italy).