Conceived and designed the experiments: JAR TS CEM. Performed the experiments: JAR CEM. Analyzed the data: JAR TS CEM. Contributed reagents/materials/analysis tools: JAR TS CEM. Wrote the paper: JAR TS CEM.
TS is an employee of Perkin Elmer. This does not alter the authors’ adherence to all the PLoS ONE policies on sharing data and materials.
Data from the 1000 genomes project (1KGP) and Complete Genomics (CG) have dramatically increased the numbers of known genetic variants and challenge several assumptions about the reference genome and its uses in both clinical and research settings. Specifically, 34% of published array-based GWAS studies for a variety of diseases utilize probes that overlap unanticipated single nucleotide polymorphisms (SNPs), indels, or structural variants. Linkage disequilibrium (LD) block length depends on the numbers of markers used, and the mean LD block size decreases from 16 kb to 7 kb,when HapMap-based calculations are compared to blocks computed from1KGP data. Additionally, when 1KGP and CG variants are compared, 19% of the single nucleotide variants (SNVs) reported from common genomes are unique to one dataset; likely a result of differences in data collection methodology, alignment of reads to the reference genome, and variant-calling algorithms. Together these observations indicate that current research resources and informatics methods do not adequately account for the high level of variation that already exists in the human population and significant efforts are needed to create resources that can accurately assess personal genomics for health, disease, and predict treatment outcomes.
A primary goal of the human genome project was to produce a high quality DNA sequence that could serve as a common reference for understanding the genetic basis of health and disease. The reference sequence has been a guiding principle for the development of a vast array of reagents, arrays, genotyping assays, computational tools, and clinical resources. Moreover, the reference sequence is the foundation for databases and bioinformatics algorithms that are used to define target regions for resequencing, perform genome wide association studies, or measure inter-species conservation. Thus, the reference sequence has become essential for clinical applications, and is used to determine alleles for risk, protection, or treatment-specific response in human disease
Because so much work is currently based on the concept of a standardized reference sequence, we have evaluated the extent to which our growing knowledge of human genome variation should alter this paradigm. New data emerging from the 1000 Genomes Project (1KGP)
Microarrays have been one of the most utilized tools in genetic research and are the basic platform for Genome-Wide Association Studies (GWAS). These arrays contain millions of DNA probes that are used to determine the genotypes of polymorphic loci. The optimal genotyping capacity is predicated on two basic assumptions concerning genomic sequences. First, the genotyped samples are completely identical at all locations within the probe, except for the targeted SNP. Second, the SNP must be one of the two variants for which the array was designed as variants are assumed to be biallelic. If either of these two conditions are violated, then the microarray probe will not function as well, or at all, and lead to false negative or false positive results for some individuals
Using the 1KGP dataset (25 million single nucleotide polymorphisms (SNPs) from 629 individuals), we evaluated the extent to which these assumptions are violated for each of four commonly used microarray platforms, the Illumina Human Omni1M-quad, Illumina Omni 2.5 M, Affymetrix 6.0, and the Affymetrix Axiom CEU array, which takes into account new knowledge concerning variation within the genome
Number of Microarray probes overlapping | ||||||
Array | Number ofautosomalSNP probeson the array | UnprobedSNPs | Indels | StructuralVariants | Total number ofprobes affected byeither un-probedSNPs or Indels | Total number of probesaffected by either un-probedSNPs, Indels or StructuralVariants |
Affymetrix 6.0 | 894,240 | 119,341 (13%) | 7,720 (1%) | 359,592 (40%) | 125,785 (14%) | 434,702 (49%) |
Affymetrix Axiom CEU | 607,555 | 100,339 (17%) | 11,983 (2%) | 247,944 (41%) | 111,150 (18%) | 312,736 (51%) |
Illumina 1 M | 940,876 | 144,431 (15%) | 9,996 (1%) | 378,809 (40%) | 153,192 (16%) | 469,417 (50%) |
Illumina 2.5 M | 2,390,395 | 379,271 (16%) | 35,494 (1%) | 944,510 (40%) | 410,355 (17%) | 1,191,584 (50%) |
Histogram showing the number of SNPs in upstream and downstream positions relative to the probed SNP on the Illumina1 M array. The red line indicates the location of the probed SNP.
The number of probes on the Affymetrix Axiom CEU (blue) and Illumina 2.5 M(red) arrays that are found to contain an un-probed SNP for sub-samples of the 1KGP SNPs.
Next, we examined polyallelic SNPs, where there are more than 2 possible alleles at a locus. The 1KGP data contained 496 loci over 629 genomes, whereas the Complete Genomics (CG) data contained 61,153 loci in just 69 genomes, representing a 123-fold increase in detection. Polyallelic SNPs are more difficult to investigate because many variant calling pipelines are biased toward biallelic SNPs
We then compared the “heterozygous, dual non-reference” SNP locations to lists of the probe bases on the standard microarrays and found thousands of instances where the microarray probe assumed a biallelic context but is instead polyallelic (
The impact of the above issues is that they can confound GWAS studies due to false homozygous or negative calls. Hence, we next sought to examine which of these problematic probes appear in published GWAS studies. 34% of the studies (1,708/4,972 from the UCSC genome browser) have significant hits with a microarray probe affected by neighboring un-probed SNPs, SVs, indels or polyallelic SNPs (
Widely used bioinformatics tools for examining the effect of a specific mutation on a protein’s structure, such as PolyPhen
We utilized the high-depth CG genomes to evaluate the extent to which there are multiple SNPs within an exon or the full coding regions of a gene. For each individual, there is an average of 6,077 genes (sd = 570) having multiple SNPs and an average of 3,320 (sd = 341) individual exons having multiple SNPs. The vast majority of exons with multiple SNPs only have two SNPs, but there are some exons with a larger number of SNPs. BRCA1, one of the most well studied breast cancer genes, contains numerous genetic variants. In the CG data 36 of 69 individuals have multiple (2–5) non-synonymous variants within their BRCA1 gene.
Haplotype blocks can, in theory, reduce genotyping measurements by using one SNP to “tag” other SNPs
To further determine the extent to which both the number of SNPs and the number of individuals analyzed affected LD calculations, we separately selected random samples of different percentages of individuals and SNPs from the 1KGP data for chromosome 20. As the number of SNPs increased (with a constant number of individuals), the average size of an LD block decreased from 16 kb (sd = 26 kb) to 5.4 kb (sd = 13 kb) (
The mean length of LD blocks as the number of genotyped markers increases.
Gene | Number of Blocks/total sizeof Blocks from HapMap | Mean size of Blocksfrom HapMap | Number of Blocks/total sizeof Blocks from 1KGP | Mean size of Blocksfrom 1KGP |
BRCA1 | 4/73,185 bp | 18.3 kb | 7/75,015 bp | 10.7 kb |
JAK2 | 4/132,544 bp | 33 kb | 17/131,306 bp | 7.7 kb |
While it is understood that increasing variation will decrease LD block size, the impact of increased variation has not been documented. This decreased LD has strong implications for the diagnostic genotyping of these genes and association studies
An additional measure used for studying population variation is the presence of runs of homozygosity (ROH). Individuals from a homogenous populations share similar chromosomal segments, and therefore, many contiguous loci across the individuals’ genomes would be homozygous. Conversely, individuals from admixed populations show less homozygosity in their genome. This homozygosity is investigated using SNP data and a run of homozygous (ROH) SNPs
The variant calls produced by 1KGP and CG represent two fundamentally different approaches to interrogating the variation across human genomes. The 1KGP project aims to sequence 2,500 low-coverage genomes, and the release we examined contained genotypes for 629 genomes. In contrast, the CG set contains 69 high-depth fully sequenced genomes. Since these approaches are so distinct, it is worth comparing the overlap of SNP calls between the two sets (
Through efforts like the 1KGP and CG public data releases, we are getting a new view that human variation much more extensive than previously thought. These data also expose several shortcomings of current microarray tools and alter the view of some basic tenets of the allelic variance of the human genome.
While genotyping arrays have been a main tool in genetic studies for almost two decades, only now can we observe that current arrays will do an imperfect job measuring variation in randomized populations due to the greater than anticipated extent of probe-affecting variation. The regions near known SNPs have been assumed to be largely free of indels, SNPs, and SVs, and also biallelic, but all of these assumptions are incorrect for a large fraction of probes used in most genotyping platforms. The largest percentage of the probes is affected by structural variants, which cover a substantial percentage of the genome, ranging from insertions and deletions to large-scale tandem repeats and copy number variants. When a probe matches one of these regions, the actual location that is interrogated in the genome is ambiguous. In individuals lacking the SV, the reference location is interrogated, while in individuals with the SV, the variant location is interrogated. This inconsistency will produce variable results between individuals. We acknowledge that neither 1KGP data nor arrays are perfect and therefore the exact list of probes that are found to be potentially problematic will always be a moving target. Nevertheless, the overall counts and distribution of problematic probes would be highly similar if the 1KGP data were error-free.
As the number of questionable probes increases with the number of SNPs identified, it is doubtful that a “perfect” microarray or set of “population-specific” microarrays could be constructed based on the variation found in 1KGP, because only a small fraction of the variability of the human genome will be found. Even when the 1KGP is done, it will only have sampled 2,500 people, representing a very small percentage of people on the planet. Hence, sequencing will likely always be a more robust method to assess known and unknown variation in genomes.
Similar to shortcomings in large-scale microarray-based genotyping, functional predictions based on simple categorization of variants as synonymous or non-synonymous is limited. When multiple SNPs within a gene or an exon exist, the variants can act together
Other assumptions about human genetics, developed through previous scans of genomic variation, are also being re-evaluated as we collect more data. The average LD length, often used to tag SNPs and impute variation, decreases in size with increasing numbers of variants. As LD length distributions are Poisson, the number of blocks that are shorter than the average increases significantly with increasing numbers of variants. Similarly, bioinformatics programs and analyses that use assumptions based on HapMap
Clearly, data collection and analysis are integrally connected. This work demonstrates that the best approaches for assessing global variation in the human population at both the data collection and analysis phases are at an early stage. Methods are still being developed and the best way to make global measurements is under debate. As evidenced by the 1KGP and CG data sets, each set contains similar numbers of SNPs (1KGP = 25,488,488, CG = 19,154,014), yet the two sets are clearly different in the numbers of genomes represented and average base coverage per genome. Hence, it is worthwhile considering the economic costs and benefits of the deep individual or shallow population approaches. The 1KGP approach, using low coverage sequencing on many genomes, identified more SNPs and is on course to identify many more SNPs when the expected 2,500 genomes from a broad spectrum of populations are completed. The high coverage CG approach on a few genomes has produced fewer SNPs but more detailed information for each genome. For example, thousands of polyallelic SNPs were identified within the CG data set, but were almost completely missed by 1KGP. One reason for this is that the low coverage per-genome of the 1KGP dataset requires pooled genotyping, which tends to bias against rare or singleton variations. Additional factors for this disparity include annotation techniques. The 1KGP variant calling pipeline discards tri- and tetraallelic variants as errors, because alleles are either assumed to be biallelic or simply do not occur with high enough frequency in pooled data.
The current human population is over seven billion and growing. New mutations have been accumulating for over 5000 generations at the rate of between one and 100 mutations per generation
Moreover, in applications like tumor profiling, gene expression, or other functional genomics assays, a single reference sequence can be problematic. For a cancer genome, the best reference genome to which tumor data should be aligned is a matched normal genome of patient. This is the only way to be confident that driver mutations or rearrangements are novel in the tumor and not present in normal cells. In the case of RNA assays, cDNA reads should be mapped to the samples’ genomes from which the RNA was isolated. Some regions with known high variability, like the MHC, already have alternative assemblies because a single reference sequence causes too many mapping biases. Other quantitative assays (RNA-Seq, small-RNA, ChIP-Seq, etc.) likely suffer from similar issues within individual samples but have not been systematically studied due to the high cost of creating individualized genome sequences
Nonetheless, a reference genome sequence is clearly needed for research. Without a point of reference and common coordinate, or naming system, research and clinical assay results cannot be reported in ways that allow for inter-lab comparisons and independent validation of research results. There are many important questions yet to be addressed as to how to best approach developing a universal reference sequence and establish best practices for using it. Addressing population and individual variability in a universal reference requires that we think about the genome, not as a single sequence, but rather as a union of differences. A basic coordinate system needs to be developed that can accommodate any indel and rearrangements, and analytical tools need to assume higher levels of differences than they do now. To begin addressing these issues, we need to have a much greater number of de novo assembled genomes from both evolutionarily distant and closely related individuals and improved methods for variant calling. Fortunately, much work is ongoing on both fronts
The combined pilot data from the 1000 Genomes Project:
Release 23 of HapMap containing variant calls for 90 CEU individuals based on human reference assembly hg18.
The Complete Genomics data consisted of the 69 publicly available genomes that were released in April 2011 and were downloaded from:
The list of GWAS studies from the UCSC browser
The 1KGP release includes data from 629 individuals and includes the variants identified by two of the four pipelines utilized by the 1KGP. Because of the variability between different software packages and a concern for false positives, the results of the four pipelines were merged to create a file including any call made by at least two of the pipelines. Further explanation of this process can be found at:
For the GC data, only called SNP genotypes were used, and no-call loci were ignored. All sequence data were aligned to hg19. The RefSeq genes release 37.1 was used for the determination of coding regions and the Complete Genomics annotations were used to identify of non-synonymous changes.
For the comparison of SNPs between the CG and 1KGP datasets, we took all of the hg19 coordinates for each genome, and then included a base-pair +1 and −1 for that location. This addition allowed for the potential single-base slippage
The Linkage Disequilibrium (LD) analysis was performed with a combination of PLINK
For the gene-specific analysis of BRCA1 and JAK2, the HapMap data were converted from hg18 to hg19 using the LiftMap tool (
The determination of problematic microarray probes was made by querying the 1KGP annotation files against the reference files for the arrays provided by the vendors. The counts of variants within exons and genes were determined using the consensus coding sequence CCDS
(XLS)
(TXT)