Reader Comments

Post a new comment on this article

Reproduction probabilities are not relevant

Posted by StephenSenn on 13 Mar 2013 at 20:58 GMT

As the second of their limitations of P-values, Tressoldi et al give the fact that the P-value is likely to be quite different if an experiment is repeated. This is not a reasonable or sensible criticism. If a P-value is a statement about anything it is a description about the compatibility of an experimental result with a null hypothesis. If any statement about a future experiment is of interest the statement of interest that trumps all others is that about a future experiment of infinite size, since this would deliver a definitive statement about the truth of the hypothesis in question. Tressoldi et al are writing about experiments of the same size.

In a note in Statistics in Medicine(1) I pointed out that :
"It would be absurd if our inferences about the world, having just completed a clinical trial, were necessarily dependent on assuming the following:
1. We are now going to repeat this experiment.
2. We are going to repeat it only once.
3. It must be exactly the same size as the experiment we have just run.
4. The inferential meaning of the experiment we have just run is the extent to which it
predicts this second experiment." (p2439)

There are many reasons to mistrust P-values but the repetition probability is not one of them.

Reference
1. Senn SJ. A comment on replication, p-values and evidence S.N.Goodman, Statistics in Medicine 1992; 11:875-879. Statistics in Medicine 2002; 21: 2437-2444.


Competing interests declared: I maintain a full declaration of all my interests here http://www.senns.demon.co.uk/Declaration_Interest.htm

RE: Reproduction probabilities are not relevant

gdcumming replied to StephenSenn on 14 Mar 2013 at 11:23 GMT

Thanks Stephen. I disagree. Replication is central in science, so I suggest it is reasonable to consider what information a statistical analysis gives about a replication--for example an idealised replication, just the same except with a new random sample of the same size. If statistical analysis technique A gives better information about such a replication than statistical analysis technique B, then that's a tick for A.

Considering p values, the extent to which p varies with replication is extremely large--probably surprisingly so to many researchers. Under reasonable assumptions, if an experiment gives two-tailed p = .05, then the 80% prediction interval (the 'p interval') for the one-tailed p in a replication experiment is (.00008, .44). Colloquially, just about any p value at all. It seems weird to me that we calculate p to 2 or 3 decimal places, and agonise over whether it is a whisker under or over .05, when it could so easily have been just about any other value.

I analyse and picture the distribution of p, and calculation of p intervals, and also argue that confidence intervals give much better information about replication than p values do, in:

<my name--the profanity filter won't allow me to put it here!>. (2008). Replication and p Intervals. p Values Predict the Future Only Vaguely, but Confidence
Intervals Do Much Better. Perspectives on Psychological Science, 3, 286-300.

There is also a simulation (The dance of the p values) at: www.tiny.cc/dancepvals
(Or go to YouTube and search for 'Dance p 3 Mar 09')

There is more in Chap 5 of my book, info (and the software for the dance) at: www.thenewstatistics.com

Geoff

No competing interests declared.

RE: RE: Reproduction probabilities are not relevant

StephenSenn replied to gdcumming on 14 Mar 2013 at 16:11 GMT

Thanks, Geoff but I consider you are wrong to disagree. You are basically saying that the truthfulness of a witness is not the extent to which the witness's statements corresponds to the truth but the extent to which the witness agrees with another equally reliable witness. This is a double disagreement standard and this is also not the standard used for confidence intervals. The confidence interval conventionally used is not a predictive interval for a future sample mean for a sample of the same size; it is a confidence interval for the true mean, that is to say the mean from a sample of infinite size. From a Bayesian perspective, to confuse the two is to confuse a posterior interval with a predictive interval: both have their value but the posterior interval is the evidential one.

Another point I made in the paper I cited is the following. Suppose that it were the case that you could pretty well guarantee having got a P-value that corresponded to significance at a moderate level that the next time you carried out a similar test, the result would also be significant. Then in that case you would already have effectively the evidence from two studies without having performed the second one (and so on ad infinitum), a very unsatisfactory state of affairs. People sometimes make this mistake when looking at meta-analyses. The P-value is 0.06 and the confidence interval just includes zero and they then conclude that if only another study were added the result would be significant. This is to misunderstand both what P-values and confidence intervals say about uncertainty.

You will also find these issues discussed in chapter 12 of my book Statistical Issues in Drug Development (2nd edition), Wiley, 2007,

For detailed discussion of the distribution of P-values under null and various alternative hypotheses see
Senn, SJ, P-Values, in Encylopedia of Biopharmaceutical Statistics, Chow, S. C., Ed., Marcel Dekker;2003 pp. 685-695.

Competing interests declared: I maintain a full declaration of all my interests here http://www.senns.demon.co.uk/Declaration_Interest.htm

RE: RE: RE: Reproduction probabilities are not relevant

gdcumming replied to StephenSenn on 16 Mar 2013 at 09:40 GMT

Thanks again Stephen, and for that 2003 paper of yours, also 2002 in Statistics in Medicine. All very interesting. You make clear your strong reservations about p values, although not agreeing with some of my reasons for criticizing p.

After we have defined a 95% CI, correctly, in terms of a notionally infinite sequence of replications, 95% of which capture the population parameter, how should we think about the one CI we calculate from our data? We can re-state that definition, but that's of little practical help. We have little option but to interpret our single interval, and that's pragmatically reasonable if our interval is likely to be typical of the whole sequence--which in most cases it is (an exception: very small samples). Reasonable, even tho' we are in effect considering a frequentist CI as a Bayesian credible interval.

I think that we should leaven this pragmatic interpretation by always bearing in mind the infinite sequence (the dance of the CIs) and the fact that our interval 'might be red' (i.e. be one of the 5% of the dance that miss). A further useful way to think about our interval in terms of the whole dance is to consider it as a prediction interval--a 95% CI is, on average, an 83% prediction interval for the next replication mean. (Two refs below on that.) If we scan through the dance, in the long run 83% of intervals include the next sample mean. Or do the same, for every second result, to achieve independence; same answer.)

For those reasons I suggest it is useful to consider replications with the same N, as in the dance. (One of your specific points.)

Yes, as you say (another of your specific points), our CI does not give us any extra information about the next result--our CI represents all that our data can tell us about the parameter. Even so, it is useful as one approach to thinking about our CI to think about the whole sequence, and the extent to which our CI gives us some information about the whole sequence--noting that of course that information is probabilistic. Informally, our CI gives us a reasonable idea of how 'wide' the dance is, how much bouncing around there is in the sequence. CI width is our indicator of precision of estimation.

I suggest taking that line of thinking across to p values, and thinking of our p as one from the infinite dance. p intervals are extremely wide, so our single p gives us very little idea of the dance of the p values. An additional reason for considering replications with the same N is that we can consider the p interval as indicating (probabilistically, by specifying an interval) what other p value would could easily have obtained in our experiment, if the sampling variability happened to have fallen differently. In other words, not predicting the next experiment, but considering what result we may have obtained instead. Answer: pretty much any p value at all (in lots of typical situations).

I conclude that thinking about the dances gives us one strong reason for preferring CIs over p values. (Even tho' CIs and p share underlying theory and we can translate between them, given basic info like N, and our mean.)

Geoff

<my name>. , G., Williams, J., & Fidler, F. (2004). Replication, and researchers’ understanding of confidence intervals and standard error bars. Understanding Statistics, 3, 299–311.
<my name>. , G., & Maillardet, R. (2006). Confidence intervals and replication: Where will the next mean fall? Psychological Methods, 11, 217–227.

No competing interests declared.

RE: RE: RE: RE: Reproduction probabilities are not relevant

StephenSenn replied to gdcumming on 16 Mar 2013 at 22:04 GMT

Frequentist or Bayesian there's nothing special about talking about another experiment the same size as the one you just ran and in fact if this were important, which it's not, I showed in my Stats in Medicine letter that if the replication probability were a problem for P-values it would also be a problem for Bayesian statements.In fact it's not important. You may have chosen a rather inadequate sample size but will still need to know what the evidence you have says about the world and not what it says about some other equally inadequate experiment.

Imagine the following
Politician: So your calculations give me information about my degree of support in the population?
Statistician: No. It's far more important for you to know what you would find in another randomly chosen sample of 500.

Competing interests declared: I maintain a full declaration of all my interests here http://www.senns.demon.co.uk/Declaration_Interest.htm