Reader Comments
Post a new comment on this article
Post Your Discussion Comment
Please follow our guidelines for comments and review our competing interests policy. Comments that do not conform to our guidelines will be promptly removed and the user account disabled. The following must be avoided:
- Remarks that could be interpreted as allegations of misconduct
- Unsupported assertions or statements
- Inflammatory or insulting language
Thank You!
Thank you for taking the time to flag this posting; we review flagged postings on a regular basis.
closeA couple questions
Posted by dahinds on 24 Jan 2012 at 08:22 GMT
- the methods and Table 3 indicate that 3613 iControlDB controls were used, but the numbers used in each stage in Table 2 add to 3877. Can you confirm that no controls were used more than once?
- what criteria were used for excluding related samples? Were these applied to controls as well as cases, and were the NECS replication samples tested for relatedness with the NECS discovery samples?
- in the training/testing analysis of the risk model in the discovery data, did you repeat the entire GWAS and SNP selection for each of the 1000 training replicates? Or did you reuse the set of (400?) SNPs selected from the full GWAS, in all of the replicates?
Thanks for any clarifications!
RE: A couple questions
sebastiani replied to dahinds on 24 Jan 2012 at 16:16 GMT
Hi David
1) No control was used more than once in the discovery and the two replication sets. Note that, in addition to the Illumina controls, we also used controls enrolled in the NECS, so the overall number of controls is larger than controls from the Illumina iControlDB. These numbers are shown in Table 2.
2) We checked all subjects for relatedness as a step prior to the stratification analysis with Eigenstrat, and removed related people. In the selection of unrelated subjects in the NECS, we choose the oldest between two siblings to be used as case, and the youngest among two siblings to be used as control.
3) In the training/testing step we decided not to repeat the whole analysis to speed up computations. The purpose of this step was really to decide which of the top associated SNPs were sufficient for prediction. This strategy seems to work fine in this particular application, but we are currently exploring improvement to the SNP selection strategy. We discuss this in the limitation section of the manuscript.
Thank you again for these questions !
Paola Sebastiani
RE: RE: A couple questions
dahinds replied to sebastiani on 24 Jan 2012 at 16:59 GMT
Thanks:
I do see that some controls came from NECS, but was looking specifically at just the Illumina numbers. The text says that 3613 iControlDB controls were used, and the sum of row 3 of Table 3 is 3613. The text also says that 673+341+2863 iControlDB controls were used in the three stages, which agrees with Table 2, but that adds to 3877.
RE: RE: RE: A couple questions
sebastiani replied to dahinds on 24 Jan 2012 at 23:26 GMT
Hi David
the difference is between the number of controls that were used for genetic matching (Table 3) and the overall number of controls used in the various analyses (Table 2). Not all controls were used for genetic matching.
Paola
RE: RE: A couple questions
dahinds replied to sebastiani on 25 Jan 2012 at 08:56 GMT
Hi Paola,
Given that you did not repeat the whole analysis in the training/test iterations, have you considered that this might result in overfitting, as discussed in your ref. 41?
- Dave
RE: RE: RE: A couple questions
sebastiani replied to dahinds on 25 Jan 2012 at 14:40 GMT
Hi Dave
I agree that some overfitting is inevitable. To try to limit this we choose for example to stop including SNPs when the sensitivity/specificity did not increase substantially. But the risk of overfitting remains and that is why the overall accuracy of the model in the discovery set should be not given too much importance. The result in the two independent sets is what matters.
Paola