Conceived and designed the experiments: DS SY SA TH JE. Performed the experiments: DS. Analyzed the data: DS AM TH GP. Contributed reagents/materials/analysis tools: DS SY BK SA AM TH GP JE. Wrote the paper: DS SA TH.
Current address: Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America,
Current address: Division of Preventive Medicine, Brigham and Women's Hospital, Boston, Massachusetts, United States of America
The authors have declared that no competing interests exist.
The vast majority of genetic risk factors for complex diseases have, taken individually, a small effect on the end phenotype. Population-based association studies therefore need very large sample sizes to detect significant differences between affected and non-affected individuals. Including thousands of affected individuals in a study requires recruitment in numerous centers, possibly from different geographic regions. Unfortunately such a recruitment strategy is likely to complicate the study design and to generate concerns regarding population stratification.
We analyzed 9,751 individuals representing three main ethnic groups - Europeans, Arabs and South Asians - that had been enrolled from 154 centers involving 52 countries for a global case/control study of acute myocardial infarction. All individuals were genotyped at 103 candidate genes using 1,536 SNPs selected with a tagging strategy that captures most of the genetic diversity in different populations. We show that relying solely on self-reported ethnicity is not sufficient to exclude population stratification and we present additional methods to identify and correct for stratification.
Our results highlight the importance of carefully addressing population stratification and of carefully “cleaning” the sample prior to analyses to obtain stronger signals of association and to avoid spurious results.
Complex diseases result from the intricate interactions of multiple environmental and genetic factors. In most cases, common genetic risk factors explain, individually, only a small proportion of the variance of quantitative traits and show modest associations between affected and non-affected individuals. Currently, most association studies include several hundred cases and controls from one single population, but the sample sizes are out of necessity increasing as a result of the expected relatively modest associations. In addition, the recent release of detailed descriptions of genetic diversity in non-European populations, such as those provided by the International HapMap project
We analyzed individuals recruited for the INTERHEART study
Argentina | 100 | |||
Australia | 433 | 5 | ||
Bahrain | 45 | 21 | ||
Bangladesh | 414 | |||
Botswana | 13 | 3 | ||
Brazil | 44 | |||
Canada | 109 | 2 | ||
Chile | 4 | |||
Colombia | 2 | |||
Croatia | 481 | |||
Egypt | 1037 | 1 | ||
Hungary | 152 | |||
India | 358 | |||
Iran | 460 | |||
Italy | 1 | 303 | ||
Japan | 2 | |||
Kenya | 1 | |||
Kuwait | 669 | |||
Malaysia | 1 | 58 | ||
Mozambique | 4 | 12 | ||
Nepal | 316 | |||
Pakistan | 1 | 966 | ||
Philippines | 2 | |||
Poland | 1301 | |||
Qatar | 20 | 1 | 56 | |
Russia | 539 | |||
Singapore | 1 | 46 | ||
South Africa | 5 | 58 | ||
Spain | 141 | |||
Sri Lanka | 190 | |||
Sult. Oman | 241 | |||
Sweden | 3 | 571 | 1 | |
Thailand | 2 | |||
U.S.A. | 53 | |||
UAE | 83 | 15 | 387 | |
Zimbabwe | 13 | 4 |
In addition, we genotyped the same SNPs in 1,062 individuals from the HGDP-CEPH Human Genome Diversity Cell Line Panel
Candidate genes were selected according to previous reports of association with MI or with one of the nine modifiable risk factors associated with MI
We retrieved the chromosome coordinates of each selected gene according to its refSeq annotation and included 10 kb of upstream and downstream DNA sequence to capture possible
In addition, we included all coding non-synonymous SNPs with a MAF larger than 5% (109 cSNPs, including 54 non-tSNPs) as well as SNPs that have been shown in the literature to be directly associated with MI, lipid metabolism or one of the other intermediate phenotypes relevant for the study of MI (145 SNPs, including 81 non-tSNPs). The final list of SNPs genotyped is shown in Supplemental
1,536 SNPs were genotyped using Illumina's GoldenGate technology based on allele-specific primer extension followed by highly multiplex PCR using universal primers
For each SNP, we determined whether two individuals from the same population sample shared 0, 1 or 2 allele(s) and averaged the allele sharing over all genotyped SNPs. We then compared the proportion of shared alleles for every pair-wise comparison within one population sample to a normal distribution and displayed the results in a quantile-quantile (QQ) plot.
After excluding identical, or nearly identical, samples (i.e., more than 99% of alleles shared, N = 170), we randomly selected 88 individuals from pairs that shared more than 83% of their alleles. This value corresponds to the relatedness cut-off empirically estimated (see
To detect whether cases were significantly more related to each other than the controls to each other (or inversely), we tested in each population sample the distribution of allele sharing among cases to the distribution of allele sharing among controls. We calculated all pair-wise comparisons of allele sharing between two cases and all pair-wise allele sharing between two controls and tested the difference of the means of the two distributions by a Welch Two Sample t-test. We assessed the significance of the t-statistic by 300 permutations: for each population sample, we randomly assigned the individuals into two groups (i.e. regardless of the disease status) and tested the difference between the mean of the two distributions consisting of all possible pair-wise comparisons within each group. To evaluate the power of these analyses, we used unrelated individuals from the Saguenay-Lac St-Jean region (SLSJ, Quebec, Canada) that have been genotyped at the same SNPs
To estimate population stratification at a gross level, we used the program STRUCTURE
We generated second generation population samples by first removing problematic samples and centers: 1) we randomly excluded one individual from each pair of related individuals (N = 131), 2) all individuals that were clustered by STRUCTURE among sub-Saharan Africans or East Asians (N = 104), and 3) all individuals from two centers that showed a very high proportion of problematic samples (including more than 10% discrepancies between reported and genetically-inferred sex, N = 719). In addition, all Nepalese and Iranian individuals were removed from, respectively, the South Asian and the Arab population sample (N = 776). Supplemental
We tested, separately in each population sample, the association between genotypes and ApoB levels in blood for each SNP by an analysis of variance (ANOVA). We used sex, age and waist circumference as covariates in these analyses and excluded individuals with diabetes (defined as self-reported diabetes, on medication pre-admission for diabetes, oral hypoglycemics, insulin or with HbA1c>7%) or on pre-admission medication for lowering cholesterol or blood pressure (inclusion of diabetic individuals led to the same strong associations with ApoB, data not shown). We also included as covariates for some of the analyses the recruitment center and the coefficients of ancestry inferred by STRUCTURE for each individual (using the results obtained by analyzing all individuals from the three population samples together). To estimate whether multiple significant associations from the same region were independent or simply due to LD, we tested hierarchically the associations by successively including the genotypes of stronger associations as covariates.
To estimate whether the datasets made of individuals of a same self-reported ethnicity were roughly genetically homogenous, we calculated in each population sample the proportion of shared alleles between every pair of individuals. If individuals are sampled randomly from a homogeneous random-mating population, we expect every individual to be, on average, equally distant genetically from everybody else (since information from many unlinked loci is summarized). We thus plotted the distribution of allele sharing for all pair-wise comparisons within each population sample against a normal distribution (see
The graph shows the QQ plot of the distribution of all pair-wise measures of allele sharing against a normal distribution (the red line displays the expectation). The green line shows to the empirical cut-off used to identify related individuals (correspond to an allele sharing larger than 83%). The deviation on the left-hand side of the graph (i.e. low allele sharing) corresponds to pairs of individuals originating from different sub-populations.
The presence of closely related individuals can generate spurious results but is unlikely to strongly influence association studies unless they make up a large proportion of the dataset. On the other hand, the possibility that the cases are, on average, more closely related to each others than are the controls (or inversely) is particularly worrying since this difference in genealogy depth could potentially generate large numbers of false positives
“European” (blue dots), “Arabs” (green dots) and “South Asian” (pink dots) individuals are displayed according to their coefficients of ancestry in three populations (K = 3) as estimated by STRUCTURE using 127 SNPs. The coefficients of ancestry display separately for each population samples were inferred from a single analysis (i.e. all individuals combined) and are represented using the same axes. See also Supplemental
Based on the results of these analyses we generated second generation datasets after exclusion of problematic samples. We randomly excluded one individual from each pair of related individuals, all individuals that were clustered by STRUCTURE among sub-Saharan Africans or East Asians and all Nepalese and Iranian individuals. In addition, we excluded all individuals from two centers that showed a very high proportion of problematic samples (including more than 10% discrepancies between reported and genetically-inferred sex). This consequently reduced our sample sizes to 4,069 individuals in the European population sample (starting from 4,292), 2,450 in the South Asian sample (out of 2900, including 316 Nepalese) and 1,399 individuals in the Arab sample (out of 2559, including 460 Iranians).
We tested separately in each population sample the association between genotypes and Apolipoprotein B (ApoB) concentration (see
The plot shows the observed distribution of the p-values (y-axis) against the expectation under a model without any association (grey crosses and x-axis). The axes are in logarithmic scales. Red crosses correspond to the association between ApoB and the genotypes at one SNP without any correction. Blue crosses stand for the same tests using recruitment centers used as additional covariates.
In the European dataset (but not in the South-Asian and Arab datasets), the distribution of p-values shows a bump with a higher significance level for the SNPs with p<0.05 (74 SNPs) than we would expect by chance (Supplemental
To estimate the influence of stratification on the results obtained and the loss of power resulting from the reduction in sample size, we contrasted the results of the associations with ApoB concentration prior to and after “cleaning” in each dataset. In all population samples, we observe a reduction in the deviation of the p-value distribution from the expectation after removing outlier individuals and/or centers (i.e. in the second generation population samples). The changes are more dramatic in the Arab dataset than in the South-Asian and European datasets (Supplementary
SOUTH ASIAN | |||||||||||
Raw data (N = 2065) | Second generation population sample | ||||||||||
w/o Centers | w/Centers | w/o Centers | w/Centers | ||||||||
rs429358 | 2.03E-07 | APOE | rs429358 | 4.77E-08 | APOE | rs429358 | 1.90E-06 | APOE | rs429358 | 6.75E-07 | APOE |
rs6511720 | 2.92E-04 | LDLR | rs6511720 | 1.12E-04 | LDLR | rs6511720 | 9.52E-05 | LDLR | rs6511720 | 8.93E-05 | LDLR |
rs1713223 | 5.36E-04 | APOB | rs9650662 | 1.23E-03 | FDFT1 | rs1534863 | 4.04E-04 | FDFT1 | rs4762 | 4.22E-04 | AGT |
rs1534863 | 1.20E-03 | FDFT1 | rs1534863 | 1.35E-03 | FDFT1 | rs9650662 | 4.29E-04 | FDFT1 | rs1534863 | 4.47E-04 | FDFT1 |
rs9650662 | 1.25E-03 | FDFT1 | rs4762 | 2.67E-03 | AGT | rs4762 | 1.05E-03 | AGT | rs9650662 | 4.55E-04 | FDFT1 |
ARAB | |||||||||||
Raw data (N = 1527) | Second generation population sample | ||||||||||
w/o Centers | w/Centers | w/o Centers | w/Centers | ||||||||
rs7449650 | 2.07E-06 | LPA | rs429358 | 7.37E-06 | APOE | rs1328757 | 6.72E-05 | PCK1 | rs429358 | 3.26E-05 | APOE |
rs3931914 | 1.56E-05 | HMGCR''' | rs3758294 | 1.24E-04 | ABCA1 | rs429358 | 6.96E-05 | APOE | rs6511720 | 3.27E-04 | LDLR |
rs3758294 | 1.64E-05 | ABCA1 | rs5995472 | 5.23E-04 | LGALS2 | rs7449650 | 1.16E-04 | LPA | rs1328757 | 5.34E-04 | PCK1 |
rs429358 | 2.40E-05 | APOE | rs3827608 | 1.12E-03 | HMGCR''' | rs6511720 | 1.30E-04 | LDLR | rs865716 | 5.87E-04 | SCARB1 |
rs3827608 | 1.07E-04 | HMGCR''' | rs2291427 | 2.05E-03 | ALOX5 | rs8102912 | 2.55E-04 | LDLR | rs989892 | 6.97E-04 | SCARB1 |
EUROPEAN | |||||||||||
Raw data (N = 2595) | Second generation population sample | ||||||||||
w/o Centers | w/Centers | w/o Centers | w/Centers | ||||||||
rs429358 | 4.31E-10 | APOE | rs429358 | 1.51E-09 | APOE | rs429358 | 4.37E-10 | APOE | rs429358 | 1.79E-09 | APOE |
rs693 | 4.76E-05 | APOB | rs662799 | 2.66E-05 | APOA5 | rs693 | 9.50E-05 | APOB | rs662799 | 4.71E-05 | APOA5 |
rs2072560 | 3.10E-04 | APOA5 | rs2072560 | 3.33E-05 | APOA5 | rs2072560 | 1.65E-04 | APOA5 | rs2072560 | 5.66E-05 | APOA5 |
rs662799 | 5.40E-04 | APOA5 | rs651821 | 5.49E-05 | APOA5 | rs662799 | 2.34E-04 | APOA5 | rs651821 | 6.90E-05 | APOA5 |
rs7575840 | 6.69E-04 | APOB | rs693 | 5.87E-05 | APOB | rs651821 | 2.87E-04 | APOA5 | rs693 | 7.89E-05 | APOB |
see
locus also contains CTSB
locus also contains COL4A3BP
One of the main drawbacks of population-based association studies (in comparison to family-based association studies) are their susceptibility to population stratification
The INTERHEART study was originally designed as a “matched” case-control study but was unmatched in the analysis of nine modifiable risk factors
One of the most common arguments advanced to explain the lack of reproducibility in population-based association studies is the presence of undetected subpopulations in the sample, leading to spurious results (e.g.
Several studies have looked at the effect of stratification from a theoretical perspective and sometimes reached contradicting conclusions
Map showing the geographic origin of each INTERHEART individual analyzed in this study. Each pie graph shows if at least one individual with self-reported ethnicity defined as “European” (blue section), “South-Asian” (pink section) or “Arabs” (green section) has been recruited in the country (regardless of the number of individuals recruited, see Supplemental
(0.50 MB TIF)
The graphs show the distribution of individuals according to their coefficients of ancestry from each population (K = 3). The left panel correspond to the assignments using 127 SNPs highly differentiated across population, the right panel to the assignments using 133 SNPs randomly selected.
(0.20 MB TIF)
QQ plot of the distribution of pair-wise allele sharing among the South Asian (left panel) and Arab (right panel) individuals against a normal distribution.
(0.06 MB TIF)
Estimation of cryptic relatedness in Europeans. The graph displays the distribution of the t-statistic obtained in 300 tests of the difference in means between the distributions of allele sharing within two groups of randomly assigned individuals (Welch Two Sample t-test). The red arrow shows the t-statistic obtained by testing the INTERHEART Europeans cases vs. controls. The green arrow corresponds to the comparison of the distribution of pair-wise allele sharing among the Saguenay Lac St-Jean (SLSJ) individuals vs. the allele sharing observed in Europeans from the INTERHEART study. The pink arrow shows the t-statistic obtained in the comparison of inter-sample allele sharing (i.e., one SLSJ individual compared to one European individual from INTERHEART) vs. the distribution of allele sharing in Europeans.
(0.07 MB TIF)
Effect of STRUCTURE on the distribution of the p-values for the associations between the genotypes and ApoB level in South-Asians. The plot shows the observed distribution of the p-values against the expectation under a model without any association (axes in logarithmic scales). Red crosses correspond to the association between ApoB and the genotypes at one SNP without any correction. Light blue crosses stand for the same tests using the coefficients of ancestry from STRUCTURE used as additional covariates
(0.08 MB TIF)
Distribution of the p-values for the associations between the genotypes and ApoB level in Europeans. Red crosses correspond to the non-corrected association between ApoB and the genotypes at one SNP. Blue crosses stands for the same tests after correcting for the signal of the five strongest associations (i.e. by conditioning the analyses on the genotypes at the five strongest associations).
(0.05 MB TIF)
Distribution of the p-values for the associations between the genotypes and ApoB level in raw and cleaned datasets. Crosses correspond to the association between ApoB and the genotypes at one SNP using the raw (x-axis) and the cleaned datasets (y-axis). Green, Pink and Blue crosses stand for respectively the tests in the Arab, South-Asian and European datasets.
(0.10 MB TIF)
Description of the SNPs included in this study.
(0.04 MB PDF)
Excluded samples
(0.01 MB PDF)
Outliers identified by STRUCTURE with substantial ancestry from South-East Asia or Sub-Saharan Africa
(0.01 MB PDF)
Tagging Efficiency
(0.42 MB DOC)
We thank Matthew McQueen and Michael Phillips for technical assistance and sample management, Fanny Chagnon for organizational guidance, Sumathy Rangarajan for helping with sample and questionnaire issues and the INTERHEART coordinators for sample collection.