Advertisement
Research Article

Identifying Selected Regions from Heterozygosity and Divergence Using a Light-Coverage Genomic Dataset from Two Human Populations

  • Taras K. Oleksyk mail,

    *E-mail: oleksyk@ncifcrf.gov

    Affiliations: Laboratory of Genomic Diversity, National Cancer Institute at Frederick, Frederick, Maryland, United States of America, Basic Research Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, Maryland, United States of America

    X
  • Kai Zhao,

    Affiliation: Laboratory of Genomic Diversity, National Cancer Institute at Frederick, Frederick, Maryland, United States of America

    X
  • Francisco M. De La Vega,

    Affiliation: Applied Biosystems, Foster City, California, United States of America

    X
  • Dennis A. Gilbert,

    Affiliation: Applied Biosystems, Foster City, California, United States of America

    X
  • Stephen J. O'Brien,

    Affiliation: Laboratory of Genomic Diversity, National Cancer Institute at Frederick, Frederick, Maryland, United States of America

    X
  • Michael W. Smith

    Affiliations: Laboratory of Genomic Diversity, National Cancer Institute at Frederick, Frederick, Maryland, United States of America, Basic Research Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, Maryland, United States of America

    X
  • Published: March 05, 2008
  • DOI: 10.1371/journal.pone.0001712

Reader Comments (16)

Post a new comment on this article

Equations of MSP and MSG

Posted by donthu on 17 Mar 2008 at 19:22 GMT

Congratulations for the wonderful work done in applying different measures for identifying selection regions. I would like to know explanations of the terms used in the equations for calculating MSP and MSG that are used in Fst equation.

In the equation for MSP, what does the term pbar A means. In the equation for MSG there is a term n1. I am not sure what does it refers to. Please clarify me.

Thank you,
Kiran



RE: Equations of MSP and MSG

oleksyk replied to donthu on 18 Mar 2008 at 17:43 GMT

Thank you for the kind words. The MSP and and MSG terms came from the article by Akey, (Akey et al., Interrogating a high-density SNP map for signatures of natural selection, Genome Res. 12 (2002), pp. 1805–1814) which in turn comes from Weir and Cockerham paper (Weir and Cockerham CC., Estimating F-statistics for the analysis of population structure. Evolutiion, 38: 1358-1370).

Let me explain what each term means in context of the two samples I worked with, so it would be easier to follow if you want to apply it to your own example:

MSP is the observed mean square error for loci between populations:
MSP= (count of European alleles * (frequency of European allele - (frequency of European allele + frequency of African allele)/2) squared + (count of African alleles * (frequency of African allele - (frequency of African allele + frequency of European allele)/2) squared

MSG is the observed mean square error for loci within populations:
MSG = 1/(count of European alleles + count of African alleles - 2) * ((count of European alleles * frequency of European alleles * ( 1-frequency of European alleles)) + (count of African alleles * frequency of African alleles * (1 - frequency of African alleles)))

nc is a average sample size across samples that also incorporates the variance in sample sizes over the populations:
nc=(count of European alleles + count of African alleles) - ((( count of European alleles) squared + (count of African alleles) squared)/ (count of European alleles + count of African alleles))

Fst then is calculated as :
Fst=(MSP-MSG)/(MSP+(nc-1)*MSG)

Tis is a point estimate of Fst at each snp. It should be noted, however, that this estimate may result in negative values, which are usually zeroed. In the above example, the allele frequency was assumed to be the allele frequency of the major allele in Europeans.

P.S. The above equation can be coded in sas DATA step language as follows:
nc=(cs_count+aa_count)-(((cs_count)**2+(aa_count)**2)/(cs_count+aa_count) );
msp=cs_count*(csfreq-(csfreq+aafreq)/2)**2 + aa_count*(aafreq-(csfreq+aafreq)/2)**2;
msg=1/(cs_count+aa_count-2)*((cs_count*csfreq*(1-csfreq))+(aa_count*aafreq*(1-aafreq)));
if msp=0 then fst=0; else if msg=0 then fst=0; else fst=(msp-msg)/(msp+(nc-1)*msg);