One of the authors is employed by a commercial company (Shanhai Genome Biotechnology Company, Shanghai, China). The authors declare that this does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.
Conceived and designed the experiments: YP QW. Performed the experiments: QC ZZ ZC RL XX CY HY XZ. Analyzed the data: QC YY YM ZW PH YT. Contributed reagents/materials/analysis tools: HY FY YZ ZZ. Wrote the paper: QC QW.
Next-generation sequencing (NGS) approaches are widely used in genome-wide genetic marker discovery and genotyping. However, current NGS approaches are not easy to apply to general outbred populations (human and some major farm animals) for SNP identification because of the high level of heterogeneity and phase ambiguity in the haplotype. Here, we reported a new method for SNP genotyping, called genotyping by genome reducing and sequencing (GGRS) to genotype outbred species. Through an improved procedure for library preparation and a marker discovery and genotyping pipeline, the GGRS approach can genotype outbred species cost-effectively and high-reproducibly. We also evaluated the efficiency and accuracy of our approach for high-density SNP discovery and genotyping in a large genome pig species (2.8 Gb), for which more than 70,000 single nucleotide polymorphisms (SNPs) can be identified for an expenditure of only $80 (USD)/sample.
Genetic variants, particularly single nucleotide polymorphisms (SNPs), are the basis of genetics and enable study of the genetic mechanism underlying human diseases and agriculturally important traits. Recently, next generation sequencing (NGS) technology has enabled the discovery of hundreds of thousands of SNPs and validation by “parallel” sequencing with plummeting cost
The RAD-seq method, which sequences target regions deeply (>20×) and enables markers to be genotyped accurately for outbred populations, is expensive and labor-intensive for high-throughput SNP detection because of the high sequencing depths and complex library preparation protocol
Human and some major farm animals (cattle, sheep and pig) are outbred populations of species with high level of heterogeneity. There are considerable differences in genome size and structure compared to the inbreeding plant species. So, the library preparation and the method of marker discovery and genotyping have its uniqueness. Apparently, flexible and cost-effective protocol is required that can be implemented in outbred populations.
In order to balance the cost and accuracy of genome-wide marker discovery and genotyping, we report a medium sequencing depth (5–20×/site per individual, on average) approach called genotyping by genome reducing and sequencing (GGRS) to address the challenge of genotyping an outbred population. The GGRS approach is mainly based on the simple procedure of library preparation and a marker discovery and genotyping pipeline. It is effortless and highly reproducible to reduce the genome complexity to ensure sufficient sequence coverage especially for species with large genomes for our GGRS approach. The goal of this article is to describe the approach and evaluate its efficiency and accuracy for high-density SNP discovery and genotyping in a outbreed population.
The procedure of library preparation for GGRS approach was improved based on RAD-seq and GBS to adapt genotyping for outbred populations.
Blue block, an example of a genomic region containing restriction sites; red block, variation in the cut site at 2000 bases of individual 1 and 700 bases of individual 2 and is not cut by the restriction enzyme. Individuals 1 and 2 are light pink and light green, respectively. The red word Yes with gray shading shows that the step is necessary in the protocol. Both GBS and GGRS are simpler than RAD-seq; furthermore, GGRS discards two clean-up steps in GBS instead of size selection.
Compared to GBS, our GGRS approach further improves the procedure to make it more suitable for use with an outbred population while keeping the simplicity, the rapidity and the reproducibility. Moreover, the GGRS approach includes three improvements: 1) fragments were selected by gel electrophoresbased on the genomic properties of an outbred population, which decreased cost, 2) one set of adapter-barcodes was designed to meet the requirements of depth and coverage to attain greater genotype accuracy, and 3) GGRS method merged removing-adapter steps and PCR clean-up steps into the last PCR gel-purification steps to reduce variation of fragment number between individuals.
Approval by the Institutional Animal Care and Use Committee of Shanghai Jiao Tong University (contract no. 2011–0033) was given for all experimental procedures involving animals in the present study.
The GGRS procedure was initially established for a large genome pig species (about 2.8 Gbp). DNA samples were obtained from 36 Landrace pigs and 36 Large White pigs. High molecular weight genomic DNAs were extracted from ear tissue using the Multisource Genomic DNA extraction kit (Axygen Biotechnology (Hangzhou) Co., Ltd, China).
According to the different levels of linkage disequilibrium (LD) between breeds from different regions, haploblocks are up to 10 kb in Chinese breeds and up to 400 kb in European breeds
Only one type of adapters following the standard Illumina sequence was used for paired-end DNA libraries with a set of barcodes of 4–8 base modifications (
where XXXXX and YYYYY denote the barcode and the reverse barcode complementary sequences, respectively. When annealed, the plus and minus strand oligonucleotides formed the double-strand divergent Y formation. The adapter-barcodes were ligated on the two ends of digested fragments by complementary overhang sequences (
Stock solutions of 100 μM plus and minus strand adapters were made in 10 mM Tris–HCl (pH 7.8). Solutions of the appropriate adapter pairs were mixed 1∶1 (v/v) to give a final concentration of 50 μM, and then the annealing reaction was run in a ABi Veriti thermocycler. The mixture was heated at 95°C for 2 min, and then ramped down to 25°C at a rate of 0.1°C/s, kept at 25°C for 30 min and naturally cooled to 4°C. The 5′ end phosphorylation of the annealed mixture was performed in a thermocycler. The 10.0 μl reaction mixture, which included 1.0 μl of 10×T4 PNK buffer, 1.0 μl of 10 mM ATP, 0.1 pM (about 2.5 μg) annealed mixture and 10 U of T4 PNK (NEB co., USA) was incubated at 37°C for 1 h. After heating inactivated at 65°C for 20 min, the phosphorylated adapter was diluted to a concentration of 2 ng/μl (0.07 pM/μl) before ligating reaction.
We diluted genomic DNA to ∼50 ng/μl (Quantified by PicoGreen, Promega, USA) and transferred 100 ng of DNA from each sample into 72- well PCR plates. The DNA was digested with 10 U of restriction enzyme
Primer 1.1
5′AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
Primer 2.15′CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
The primer pairs contained primer sequences and complementary oligonucleotide sequences attached to the flow cell. This design was essential to sequence by paired-end using only a set of adapter-barcodes.
The thermal cycling protocol for sequencing libraries preparation was: 95°C for 5 min, then 20 cycles of 95°C for 30, 65°C for 30, 72°C for 30, and a final elongation step at 72°C for 10 min. Each of five separate libraries was independently purified from PCR mixture using a DNA gel extraction kit (Axygen Biotech (Hangzhou) Co., Ltd, China) and then five separate libraies were mixed into one sequencing libraries. The quality of sequencing libraries was evaluated by an Agilent 2100 bioanalyzer. Sequencing libraries were used for next-generation sequencing if the fragments were in the range 300 to 400 bp. If not, the sequencing libraries were reconstructed using the protocol described above. A 72-plex sequencing libraries for each flow cell lane was sequenced by an Illumina Hiseq2000 instrument with a paired-end (2×100 bp) pattern, and the sequencing process is given in detail by the manufacturer (Illumina).
We filtered sequences according to several rules for subsequent analysis
In this study, although we used the paired-end method for sequencing, we aligned filtered reads with the pig reference genome (SGSC Sscrofa9.2) by the single-end mapping method because of the imperfection and breed specificity of the reference genome. We aligned reads to reference genome using Burrows-Wheeler Aligner (BWA) with the default settings
The successfully aligned reads were simultaneously mapped to the pig reference genome using SAMtools with default settings to discovery SNPs
The missing genotype was imputed using iBLUP software developed by our group. The iBLUP software is available at
Sequences were collected in lanes of a single flow cell at 72-plex from a pig outbred population, a genetically diverse and large genome species. A total of 380,971,530 raw reads were generated in a flow-cell lane of an Illumina High-seq 2000 sequencer for the pig population. 361,611,915 (94.9%) reads complying with the filtering rules were high-quality reads. Of which, the maximum and minimum of reads are 10,077,526 and 1,570,923, respectively. The average reads number was 5,022,387. However, one particular individual with 229,575 reads may be due to random error (
The base average quality score, a base identifying probability, was at least 20 (error rate of base calling of 1 in 100), in which the average quality score of the first 65 bp was at least 30 (error rate of base call of 1 in 1000) (
The sequencing results demonstrated that our GGRS pipeline yielded high-quality scores and a small variation of reads number between individuals by reducing genome complexity. Of “high quality” reads, 318,218,485(88%) reads were aligned with pig reference genomic sequences using BWA with default settings. The majority of no-aligning reads to reference panel represented pig genome sequences by Basic Local Alignment Tool (BLAST) of National Center for Biotechnology Information (NCBI) nt database using default settings (data not shown), which showed that a large number of gaps existed in reference sequences of using version. These sequences were used to produce consensus sequences to slightly supply reference genome. In general, assigned reads of each individual uniformly distributed across chromosomes, although there were few regional variations.The density distribution of high quality non-redundant reads across autosomes and X chromosome were showed in
In order to reduce the false positive SNPs, we taked the following three criteria: 1) the reads/site/individual aligned to build a reliable sequence reads was more than 5×
In the present pig sequencing experiment, 71,072 SNPs were discovered and 46% missing genotypes were imputed using the iBLUP program. On average, the density of putative SNPs across chromosomes was ∼0.33 SNPs/10 kb. The maximum and minimum were 0.79 on chromosome 6 and 0.19 on X chromosome, respectively (
For large genome outbred species, reducing genome complexity, optimizing barcode and simple procedure are key points for sequence libraries preparation.
The step of selecting fragments appears to be necessary for outbred species for the aim of obtaining sufficient sequencing depth for calling more accurate SNPs and genotyping. Currently, most of available genome reducing and sequencing methods are based on restriction enzymes (REs)
Our GGRS procedure employed one set of adaptors. Two ends of digested resulting fragments were ligated to identical barcode-adaptor, instead of one end ligated common-adaptor and the other end ligated barcode-adaptor. This barcode-adaptors design was beneficial for outbreed population by increasing fragments consistency between individuals. The GGRS is appropriate for parallel genotyping of large number of samples than existing RAD protocols because of the simpler procedure, and the optimized protocol for libraries preparation is helpful for saving cost and labor. Moreover, the simplified protocol allows us to use small amounts of DNA for libraries preparation (about 100 ng, even lower) that is important for studying rare materials.
Recently, a streamlined restriction site-associated DNA genotyping method called 2b-RAD has been published
Our GGRS pipeline can process 504 samples each run (72 samples/lane ×7 lanes/flow-cell, one lane as control) and >70, 000 pig SNPs can be identified in a short time for an expenditure of $80 (USD)/sample. Up to 288 samples each lane (2016 samples/flow-cell) will possibly be sequenced as along with increasing reads density of upgrading Hiseq 2000 sequencer. These improvements will accelerate the reduction of the genotyping cost to <$20 (USD)/sample.
(TIF)
(TIF)
(TIF)
(TIF)
(DOC)
(XLS)