Reader Comments

Post a new comment on this article

The original findings did not provide a good scientific argument

Posted by gfrancis on 01 May 2013 at 19:28 GMT

The comment by Dijksterhuis and the reply by Shanks and Newell provide a good demonstration of the difficulties that are faced by replication efforts. As noted in the comments, some of the replication studies by Shanks et al. might be relatively low powered. With low-powered studies it is always plausible that results fail to appear due to random sampling. On the other hand, even with low power it is surprising that so many studies should fail to find a statistically significant result if the effect were true. Note that these are statistical concerns that do not directly consider the issues related to experimental methods, those have to be addressed by subject matter experts.

Interestingly, the original findings by Dijksterhuis and van Knippenberg (1998), subsequently DK, also had relatively low power. Given the power of their experiments, it is surprising that they were uniformly successful in producing results that matched the theoretical conclusions. Due to random sampling one would not expect such good results.

Consider Experiment 1 in DK. Subjects were split into three groups (no prime, professor prime, secretary prime). The mean proportions of correct answers to a questionnaire are given in the text, and the pooled standard deviation (12.78) can be computed from those means and the F value given for the main effect. Three results were described as important for the theoretical conclusion of DK: a main effect (power=0.84), a contrast between professor and secretary primes (power=0.89), and a contrast between professor and no primes (power=0.65). A third contrast between secretary and no primes was not relevant to the theory; also additional exploratory analyses are not considered here. These power values were all estimated from 100,000 simulated experiments with the means and standard deviation reported by DK (this approach was used for all of the power analyses reported here). The three effects of interest are correlated, but the power of a single experiment to produce all three effects can be estimated with the simulated experiments. The power is 0.61.

Experiment 2 in DK has a similar design, but dropped the secretary prime and varied the duration of the professor prime. The mean proportions correct are given in the text, and the pooled standard deviation (10.44) can be computed from those means and the F value given for the main effect. Three results were important for the theoretical conclusion of DK: a main effect (power=0.96), a contrast between 9 minute and no primes (power=0.98), and a contrast between 9 minute and 2 minute primes (power=0.54). The power of a single experiment to produce all three effects can be estimated with the simulated experiments. The power is 0.54.

Experiment 3 in DK also has a similar design, but replaced the professor prime with a hooligan prime, which should reduce rather than increase scores. The mean proportions correct are given in the text, and the pooled standard deviation (10.08) can be computed from those means and the F value given for the main effect. Three results were important for the theoretical conclusion of DK: a main effect (power=0.84), a contrast between 9 minute and no primes (power=0.89), and a contrast between 9 minute and 2 minute primes (power=0.58). The power of a single experiment to produce all three effects can be estimated with the simulated experiments. The power is 0.55.

Experiment 4 in DK had a 2x2 between subjects design that varied the type of prime (intelligent or stupid) and whether the prime was provided by stereotyping or a trait target. The mean proportions correct are given in the text, and the pooled standard deviation (12.87) can be computed from those means and an F value given in the text. Two results were judged by DK to be important. First, there was a significant effect for the type of prime (power=0.76). Second, there was not a significant difference between the stereotype and trait targets. It is difficult, but not uncommon, to use a nonsignificant result as support for a theoretical conclusion. For such a use, the meaningful probability reflects how often random samples in an experiment like this would not reject the null hypothesis. That probability is 0.57 (which is consistent with the original finding just missing traditional statistical significance, p=0.082). The probability of an experiment to produce both the significant and nonsignificant results is 0.43.

The four experiments are statistically independent, so the probability of experiments like these all producing results that are consistent with the theory is the product of the power values. This value (0.077) is small enough that readers should be skeptical that the experiments in DK were run properly, analyzed properly, and reported fully. Given the uncertainty that should exist in random samples for these experiments, the reported results appear too good to be true. This kind of power analysis is very conservative (Francis, 2013), and a chance occurrence of a low product of power values is very small.

This analysis does not reveal what kind of factors produced the too successful experiments in DK. Likely candidates include researcher degrees of freedom that are sometimes described as questionable research practices or p-hacking. There may be a file-drawer with non-significant experiments, so that what is published in DK is four out of eight (or more) experiments. Some of the experiments may have been run with an optional stopping sampling approach so that the data were analyzed as gathered, and sampling stopped as soon as the desired results were found. Such an approach dramatically increases the Type I error rate. The theoretical conclusion may have been the resulting of hypothesizing after the results are known (HARKing), which would be contrary to the description in DK of theoretical predictions being validated by the experiments.

Whatever factors influenced the experimental results in DK, readers should be skeptical about the validity of the combined experiments and theoretical conclusion. In general, a few low powered experiments cannot provide convincing evidence for any theoretical conclusion because there is so much uncertainty in the empirical measurements. Being skeptical does not mean readers have to reject the notion of unconscious priming, but it does mean that the arguments in DK are not scientifically convincing. Being skeptical also does not mean levying a charge of misconduct toward DK. Many scientists are unaware how seemingly small choices in sampling, data analysis, and theorizing can make for an unconvincing scientific argument. In particular, successful replications only strengthen an argument if the experiments have high power. This observation also undermines the argument, raised in Dijksterhuis' comment, that many other labs have successfully replicated these kinds of experiments. If these successful replications have similar power to the experiments in DK, then the combined set of results becomes even more unbelievable. The argument in DK would be more convincing if the report included several experiments that did not significantly match the theoretical ideas.

I don't know whether unconscious priming influences behavior, but the original findings in DK did not provide a good scientific argument that it does.

----
A marked up copy of DK that indicates which terms were used for the above analysis, an Excel file with calculations of the standard deviations, and R files for the power analysis simulations can be downloaded from

http://www2.psych.purdue....

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology. http://dx.doi.org/10.1016...

No competing interests declared.