Conceived and designed the experiments: BLF GDJ MDS RF. Analyzed the data: BLF GDJ MDS SH. Contributed reagents/materials/analysis tools: BLF GDJ MDS. Wrote the paper: BLF SH.
The authors have declared that no competing interests exist.
In recent years, capabilities for genotyping large sets of single nucleotide polymorphisms (SNPs) has increased considerably with the ability to genotype over 1 million SNP markers across the genome. This advancement in technology has led to an increase in the number of genome-wide association studies (GWAS) for various complex traits. These GWAS have resulted in the implication of over 1500 SNPs associated with disease traits. However, the SNPs identified from these GWAS are not necessarily the functional variants. Therefore, the next phase in GWAS will involve the refining of these putative loci.
A next step for GWAS would be to catalog all variants, especially rarer variants, within the detected loci, followed by the association analysis of the detected variants with the disease trait. However, sequencing a locus in a large number of subjects is still relatively expensive. A more cost effective approach would be to sequence a portion of the individuals, followed by the application of genotype imputation methods for imputing markers in the remaining individuals. A potentially attractive alternative option would be to impute based on the 1000 Genomes Project; however, this has the drawbacks of using a reference population that does not necessarily match the disease status and LD pattern of the study population. We explored a variety of approaches for carrying out the imputation using a reference panel consisting of sequence data for a fraction of the study participants using data from both a candidate gene sequencing study and the 1000 Genomes Project.
Imputation of genetic variation based on a proportion of sequenced samples is feasible. Our results indicate the following sequencing study design guidelines which take advantage of the recent advances in genotype imputation methodology: Select the largest and most diverse reference panel for sequencing and genotype as many “anchor” markers as possible.
In the last five years, the capabilities and technology for genotyping large sets of single nucleotide polymorphisms (SNPs) has increased significantly. Current genome-wide SNP arrays have the capability to genotype over one million SNP markers across the genome. This advancement in technology has led to an increased number of completed and on-going genome-wide association studies (GWAS) for various complex disease and drug-related phenotypes. These GWAS have resulted in more than 350 publications and over 1500 SNPs implicated for association with multiple (>80) disease phenotypes or traits
Indirect association, as a result of linkage disequilibrium (LD), is a key factor in the success of genetic association studies. As a result of LD, a disease-susceptibility SNP need not be genotyped, as long as it is “tagged” by a SNP or set of SNPs that are genotyped (i.e., SNPs in LD with the disease-susceptibility SNP are genotyped). Recently this concept has been further exploited by the introduction of methods to impute genotypes at untyped markers, based on genotypes at typed markers and information about LD within the region
One approach for following up replicated findings from a GWAS would be to determine all genetic variation within the locus, especially rarer variants not currently included on GWAS SNP arrays, as they may play an important role in the etiology of the disease
In this manuscript we explore the use of the recently developed genotype imputation method implemented in MACH
To explore various approaches for imputation of untyped markers using a reference panel determined from sequencing data, we utilized a recently completed sequencing study for a gene which we will denote as
As a consequence of
African American | White non-Hispanic American | Han Chinese American | |||||
Marker | Position | ObsHET | MAF | ObsHET | MAF | ObsHET | MAF |
1 |
1270 | 0.385 | 0.203 | 0.415 | 0.303 | 0.474 | 0.416 |
5 |
1541 | 0.365 | 0.182 | 0.427 | 0.307 | 0.479 | 0.417 |
6 | 1753 | 0 | 0 | 0.021 | 0.01 | 0 | 0 |
8 | 1811 | 0.021 | 0.01 | 0.031 | 0.026 | 0 | 0 |
9 |
1812 | 0.062 | 0.031 | 0.104 | 0.062 | 0.26 | 0.193 |
10 | 1962 | 0 | 0 | 0.021 | 0.01 | 0 | 0 |
11 |
1968 | 0 | 0 | 0 | 0 | 0.135 | 0.068 |
16 | 2829 | 0.021 | 0.01 | 0 | 0 | 0 | 0 |
17 | 3092 | 0 | 0 | 0 | 0 | 0.021 | 0.01 |
18 | 3145 | 0 | 0 | 0 | 0 | 0.021 | 0.01 |
19 | 3150 | 0.083 | 0.052 | 0 | 0 | 0 | 0 |
21 | 3456 | 0.521 | 0.396 | 0.417 | 0.312 | 0.469 | 0.411 |
22 | 3525 | 0 | 0 | 0 | 0 | 0.052 | 0.036 |
24 | 4399 | 0.115 | 0.057 | 0 | 0 | 0.146 | 0.083 |
25 |
4467 | 0.052 | 0.026 | 0 | 0 | 0.146 | 0.083 |
27 | 4893 | 0.042 | 0.021 | 0 | 0 | 0 | 0 |
28 |
5016 | 0.01 | 0.005 | 0 | 0 | 0 | 0 |
29 |
5031 | 0 | 0 | 0.052 | 0.026 | 0 | 0 |
33 |
5523 | 0.053 | 0.026 | 0.083 | 0.042 | 0.26 | 0.193 |
37 | 5974 | 0.021 | 0.011 | 0.042 | 0.021 | 0 | 0 |
39 |
6166 | 0.385 | 0.203 | 0.469 | 0.432 | 0.365 | 0.224 |
40 | 6237 | 0.021 | 0.01 | 0 | 0 | 0 | 0 |
41 | 6265 | 0 | 0 | 0 | 0 | 0.062 | 0.031 |
42 |
6311 | 0.01 | 0.005 | 0 | 0 | 0 | 0 |
44 | 6862 | 0 | 0 | 0.052 | 0.026 | 0 | 0 |
45 | 7036 | 0.031 | 0.016 | 0 | 0 | 0 | 0 |
48 | 7262 | 0 | 0 | 0 | 0 | 0.073 | 0.057 |
57 |
7975 | 0.031 | 0.016 | 0 | 0 | 0 | 0 |
58 | 8057 | 0.021 | 0.01 | 0.094 | 0.047 | 0.011 | 0.005 |
59 | 8187 | 0.021 | 0.01 | 0 | 0 | 0 | 0 |
60 | 8230 | 0.021 | 0.01 | 0 | 0 | 0 | 0 |
*SNP Marker in HapMap; used as typed genotypes in all samples (i.e., markers on a GWAS SNP array).
MAF = minor allele frequency based on imputed “dosage” or expected genotype, position = physical base-pair location of the SNP based on build 36, ObsHET = observed heterozygote rate.
The initial goal of the 1000 Genomes Project was to sequence the entire genome in approximately 1200 individuals as a means of documenting all human genetic variation. Recently, the number of samples to be sequenced has increased to over 2000 samples
To explore various approaches for imputing untyped markers to augment sequence data, using a reference panel determined from sequencing a portion of the study participants, we will utilize sequence data available for
In assessing the use of sequence data for genotype imputation, the experiment varied: the proportion of samples sequenced (or the size of the reference data), the number of markers genotyped for all subjects (“anchor” SNPs) and how these markers were selected, imputation based on the sequenced participants (“reference panel”) unphased genotypes or most-likely phased haplotypes and imputation based on race specific reference haplotypes or all reference haplotypes (regardless of race). The various simulation scenarios investigated using
To assess the accuracy of the various simulation scenarios, we
Factors Varied for Imputation | Scenario | ||||
1 | 2 | 3 | 4 | ||
Most likely phase reference haplotypes | |||||
Unphased genotypes | |||||
Race specific reference haplotypes | |||||
All reference haplotypes, regardless of race | |||||
Three | |||||
Five | |||||
Seven | |||||
Ten | |||||
Tag SNPs | |||||
SNPs on GWAS array |
In Scenario 2, imputation was completed by race (as for Scenario 1), but imputation was based on all reference haplotypes for all three races. To assess the variation in haplotype assignment and impact on imputation accuracy, imputation Scenario 3 was completed, by race, with only unphased genotypes for the sequenced samples (i.e., no phased reference haplotypes used). It should be noted that not all genotype imputation methods allow for the use of only unphased genotypes and may require reference haplotypes. Lastly, Scenario 4 assessed the impact of the number of “anchor” markers genotyped for all subjects, with either 3 markers (Scenario 4.3), 5 markers (Scenario 4.5), or 7 markers (Scenario 4.7) used in the imputation with the same design as Scenario 1.
For the
To mimic the candidate gene scenario, we consider the common situation where the anchor markers, genotyped on all subjects, would be LD based tag SNPs. Therefore, to determine anchor markers for
Sequencing of
The proportion of the sample used as the reference panel is displayed on the X-axis and the mean SNP imputation quality score is displayed on the Y-axis.
The proportion of the sample used as the reference panel is displayed on the X-axis and minimum SNP imputation quality score is displayed on the Y-axis.
Proportion Sequenced | ||||||
Scenario | Race | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 |
1 | AA | 0.962 | 0.965 | 0.973 | 0.962 | 0.974 |
CA | 0.988 | 0.991 | 0.968 | 0.989 | 0.964 | |
HCA | 0.963 | 0.973 | 0.969 | 0.964 | 0.97 | |
2 | AA | 0.975 | 0.971 | 0.975 | 0.972 | 0.971 |
CA | 0.985 | 0.978 | 0.978 | 0.976 | 0.977 | |
HCA | 0.98 | 0.975 | 0.976 | 0.975 | 0.978 | |
3 | AA | 0.969 | 0.971 | 0.973 | 0.975 | 0.97 |
CA | 0.977 | 0.972 | 0.972 | 0.971 | 0.972 | |
HCA | 0.968 | 0.972 | 0.972 | 0.972 | 0.972 |
Number of Anchor Markers | ||||
Race | 3 | 5 | 7 | 11 |
AA | 0.966 | 0.972 | 0.973 | 0.974 |
CA | 0.952 | 0.960 | 0.962 | 0.964 |
HCA | 0.958 | 0.962 | 0.972 | 0.970 |
Table presents results for scenario with 50% of the samples sequenced.
Lastly, in general, as the proportion (or number) of samples sequenced increased (i.e., larger set of reference haplotypes), the quality of the imputation increased. Thus, not only does number of markers genotyped on all subject impact the accuracy of imputation, so does the size of the reference sample. In addition to variation in results due to the number of markers genotyped on all subjects and size of the reference set, two different reference sets of the same size (i.e., different samples selected for sequencing) resulted in slightly different imputation accuracy (data not shown).
The various scenarios for genotype imputation using the 1000 Genomes Project sequence data for polymorphic markers in
As completed for
The proportion of the sample used as the reference panel is displayed on the X-axis and the percent concordant between the “true” genotype and the imputed most likely genotype is displayed on the Y-axis.
The proportion of the sample used as the reference panel is displayed on the X-axis and the minimum SNP imputation quality score is displayed on the Y-axis.
The figures and table show that anchor SNPs based on a tag SNP approach dramatically outperformed the approach where anchor SNPs were based on a large SNP array (Scenario 4), in terms of both concordance rates and minimum SNP quality scores. In terms of highest minimum quality score, imputation based on reference haplotypes for all races (Scenario 2) was the “best”; however in terms of concordance between the observed and imputed most likely genotype, imputation based on race specific reference haplotypes was “best” (Scenario 1). The figures also show a slight increase in imputation performance as the size of the reference panel increased.
Lastly, with one goal of sequencing to be to detect rare variants, we compared, for the three races, the relationship between the estimated MAF, based on the imputed dosage, after imputation and the quality of imputation (
The MAF (group 1: 0≤MAF≤0.05, group 2: 0.05<MAF≤0.10, group 3: 0.10<MAF≤0.20, group 4: 0.20<MAF≤0.30, group 5: 0.30<MAF≤0.40, group 6: 0.40<MAF≤0.50) is displayed on the X-axis and the mean SNP imputation quality score is displayed on the Y-axis.
In this manuscript present the use of the recently developed genotype imputation method for sequencing studies, where the reference panel consisting of sequencing data for a fraction of the study participants. By sequencing only a portion of the samples for the follow-up of signals detected from a GWAS, followed by imputation in the remaining samples, one can significantly reduce the cost to localize the punitive variant involved in the etiology of complex disease and pharmacogenomic phenotypes. In addition, by utilizing sequence data on a portion of individuals in the study, we are able to have a perfectly matched reference panel, in terms of linkage disequilibrium, without relying on the assumption that the HapMap populations represent our study population (as HapMap based haplotypes are the current standard reference data used for genotype imputation).
Sequencing a portion of our study population also allows us to determine and assess association of rare variants not present in the HapMap database. Upon completion of the 1000 Genomes project (
Our results, based on a Sanger sequencing study of a candidate gene and preliminary data from the 1000 Genomes Project for
Select the largest and most diverse reference panel for sequencing, with respect to both haplotypes and phenotype. One can also use sequencing data from the 1000 Genomes data in addition to the sequencing of a portion of the study participants (i.e., reference panel consists of data from sequencing a portion of individuals with the disease/phenotype of interest and data from the 1000 Genomes Project)
Given that sequencing produces unphased genotypes, if possible, imputation should be carried out on the unphased genotypes in the reference panel as opposed to the most likely phase haplotypes to account for the uncertainty in haplotype assignment.
Genotype as many “anchor” markers as possible, in that, the number of markers genotyped on all subjects impacts accuracy. Therefore, additional genotyping of a few common SNP markers not already genotyped on all subjects using a cost effective platform, like Taqman, may be needed if the GWAS SNP array does not provide adequate coverage in the locus to be sequenced.
We would like to thank Linda Pelleymounter and Irene Moon for their contribution to the sequencing study of