Reader Comments

Post a new comment on this article

Selection of loci by stepwise regression is liable to generate spurious findings

Posted by deevybee on 20 Nov 2012 at 15:39 GMT

The bugbear of molecular genetics research is false positive findings. I’m concerned that this paper has not done enough to defend against them. The Mendelian randomisation approach is neat, but identification of significant associations relied on use of backwards stepwise regression, using Akaike’s information criterion (AIC) to select the best-fitting genetic model. This approach is notorious for yielding spurious findings. I wondered how serious a concern this was, and so I simulated a dataset with 10 genetic loci and one IQ measure, for 3000 participants (similar to the number of moderate drinkers in this sample). I used the <i>mvrnorm<\i> command in R, as follows:
<i>mydata=mvrnorm(myN,mymeans,mycorr,empirical=FALSE)<\i>

The variable <i>mycorr<\i> specifies the correlation between variables, and I experimented with this, both using zero intercorrelation between all variables, and estimated correlations from the Hapmap published for this sample by Zuccolo et al (2009)- (which, incidentally, are not entirely consistent with the linkages described in the current paper). This did not make much difference to the results; nor did it matter whether I worked with the random normal deviates generated by <i>mvrnorm<\i>, or data transformed into allelic counts based on the minor allele frequencies reported in the paper. Regardless of the details of the simulation, the model identified by the stepwise procedure using the AIC criterion had a high likelihood of identifying a best-fitting model that included one or more of the genetic loci as a predictor – even though they were generated totally at random to be uncorrelated with IQ. Around 38% of runs selected a model with one locus, 27% selected a model with two loci, 16% selected a model with three or more loci. I also experimented with creating a composite score, based on summing ‘risk loci’ identified in those models where two or more loci were selected, and taking into account the direction of the effect on IQ. These composites all showed robust association with the IQ measure, similar in magnitude to that reported by Lewis et al. The problem here is similar to ‘double dipping’ in neuroscience, where misleading findings result when an analysis focuses on a subset of variables selected on the basis of a prior statistical procedure with high measurement error (Kriegeskorte et al, 2010).

One other question concerns the definition of moderate drinking to incorporate a range of levels of alcohol consumption, ranging from less than one unit per week to just under one unit per day. I wondered why the researchers did not treat amount of drinking more quantitatively in their analysis, as this might have shown a dose-response relationship and given an indication of whether there is a safe level of alcohol consumption.

Drinking in pregnancy is a potentially avoidable risk factor for the foetus and many would argue that one should adopt a precautionary principle and advise all pregnant women against drinking. However, the downside of this approach, which I have seen in parents of children with developmental problems, is that mothers can be made guilty about having had an occasional drink, and attribute their child’s subsequent problems to this factor. It is important that we get better evidence on just what risks there are, and the approach adopted here, taking genotype into account, is a potential advance over previous work. However, more confidence could be placed in the findings if robust statistical procedures had been adopted.

<b>Kriegeskorte, N., Lindquist, M. A., Nichols, T. E., Poldrack, R. A., & Vul, E.<\b> (2010). Everything you never wanted to know about circular analysis, but were afraid to ask.<i> Journal of Cerebral Blood Flow and Metabolism<\i>, 30(9), 1551-1557. doi: 10.1038/jcbfm.2010.86
<b>Zuccolo, L., Fitz-Simon, N., Gray, R., Ring, S. M., Sayal, K., Smith, G. D., & Lewis, S. J.<\b> (2009). A non-synonymous variant in ADH1B is strongly associated with prenatal alcohol use in a European sample of pregnant women.<i> Human Molecular Genetics<\i>, 18(22), 4457-4466. doi: 10.1093/hmg/ddp388

No competing interests declared.

RE: Selection of loci by stepwise regression is liable to generate spurious findings

sjlewisbristol replied to deevybee on 20 Nov 2012 at 16:20 GMT

Thank you for your comment on our article. In light of your simulation I acknowledge that a stepwise regression may not have been the best way to select the variants used in our analysis. However, very little biological data is currently available on the interactions between genes in this pathway. As this is a biological pathway with a lot of redundancy and one which has been subject to heavy selection, we had a prior hypothesis that there would be epistatic interactions between the variants, and this was one way of selection polymorphisms which were important.

Some further analysis which we did on these variants, which for simplicity didn't make it into the final version of the paper, was to look at the effect of haplotypes of all 10 of the polymorphisms we tested against IQ, this confirmed our findings from the stepwise regression that the four sites identified were the most important in terms of IQ and that there were complex interactions between the polymorphisms.

More importantly the fact that we saw a strong interaction with mother's drinking, with no effect at all among children of non-drinking mothers was confirmation that these alleles are acting via an alcohol metabolism pathway, rather than being spuriously associated with IQ.

In the paper we did not break down the analysis into smaller groups based on amount drank, because we wanted to see if there was any effect of moderate drinking, which has not previously been shown, and subgroup analyses would have had reduced power to detect an effect. We have since re-run our analysis among the small group of women who reported drinking less than 1 unit throughout pregnancy and we found a similar effect to that which we reported in the paper. Although admittedly some of these women may have under-reported their drinking during pregnancy.

No competing interests declared.

RE: Selection of loci by stepwise regression

tim_bates replied to sjlewisbristol on 20 Nov 2012 at 18:31 GMT


Nice to see Mendelian Randomization studies appearing for complex traits like IQ! Especially with alcohol metabolism for which Caucasians nearly all have high efficiency alleles :-)

Re the comment above: If the gene discovery process has delivered a set of 10 risk-SNPs for which the population has variance, then rather than test dozens of combinations of those SNPs with the attendant false discovery risk, I wonder whether a better analysis strategy would be to treat the SNPset design process as final.

Then you can simply score each individual's SNPset for a-priori metabolic exposure to alcohol, and proceed with the causally powerful part of the study: The Mendelian Randomization (i.e., infantIQ as a function of (maternal drinking * maternal genetic metabolism risk)

Be nice to have measured blood-alcohol in response to an alcohol challenge across the different haplotypes. Perhaps a sample like COGA has data suitable for that purpose.

No competing interests declared.

RE: RE: Selection of loci by stepwise regression

sjlewisbristol replied to tim_bates on 22 Nov 2012 at 09:47 GMT

Re: "Be nice to have measured blood-alcohol in response to an alcohol challenge across the different haplotypes"

It would be great to have this data, although we really need it in the fetus as there is a different expression profile of these genes in the fetus compared to adults.

No competing interests declared.