On the importance of plotting; or — Psych. Science will publish anything

Edit (05/11/15): The article has been retracted.

Psychological Science recently published a paper — Sadness Impairs Color Perception1 — which has been causing a bit of a ruckus on the statistics blogosphere. Psych Science already has a bit of a reputation for poor quality control, and this newest controversy centers around a claim of an effect with very low plausibility supported by weak evidence and questionable statistical practices.

The paper describes two similar experiments, though we are primarily interested in the second. In this experiment, participant’s viewed one of two videos consisting either of the famous death scene in the Lion King, or a video of a desktop screensaver — call these the sad and neutral conditions, respectively. Afterwards, participants performed a color discrimination task requiring them to classify colors along either red-green and blue-yellow axes (this was within-subjects). The dependent measures are the accuracy in each of the color conditions (red-green and blue-yellow). As a validation, participants rated their emotional reaction to the video using an eight point Likert scale. The authors report evidence of impaired blue-yellow discrimination in the sad condition compared to the neutral condition (with no corresponding effect on red-green discrimination). They’ve made some the data publicly available (not the raw data, unfortunately; only subject means), so I decided to take at look at it myself.

Below is a reproduction of a plot found on page 4 of the paper, showing the mean accuracy in each condition, and for each color axis. Error bars are standard errors.
author_bar_plot
The author’s conclusions hinge on a pair of t-tests of the difference between sad and neutral conditions. They report a significant difference in accuracy along the blue/yellow axis (p = .043, t128 = 2.05), but not along the red/green axis (p = .38, t128 = .87). I can replicate both p-values, so no mistakes here, at least with the computations.

A major problem is that the authors are claiming that they’ve found an interaction between video condition and color axis, but they haven’t actually tested this interaction, they’ve just done a pair of independent t-tests and found different results. Had we done a different pair of tests, we would have come to a different conclusion. For example, let’s test the null-hypotheses that there is no difference between red/green and blue/yellow accuracy in each of the video conditions. We find no difference in the neutral condition (p = 0.79, t128 = 0.26) or in the sad condition (p = 0.28, t128 = 1.09). So…now both of dark gray bars are the same, and both of the light gray bars are the same. Where did the difference go? If the authors were to run an ANOVA with color-axis and video condition as factors, they would conclude that there was no interaction:

                              Df Sum Sq Mean Sq F value Pr(>F)  
Color_Axis                     1  0.017 0.01745   0.806 0.3703  
Emotion_Condition              1  0.079 0.07893   3.644 0.0574 .
Color_Axis:Emotion_Condition   1  0.005 0.00527   0.243 0.6224  
Residuals                    256  5.545 0.02166                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that this analysis is not strictly correct, since color-axis is actually a within-subjects variable, but since authors neglect that fact when comparing their not-independently-calculated t-tests, I’ll neglect it here for consistency.

I’m really only interested the purported effect of sadness on blue/yellow discrimination, so from now on I’ll ignore the red/green axis entirely. I’m a fan of displaying raw data whenever possible, so let’s start by taking a look at the participant accuracy scores:
raw_data_density
So…we’ve failed the inter-occular trauma test, at least. Already, we can see that the box plots provided in the paper are misleading; especially the error bars. The data are extremely non-normal, and the observed mean difference seems to be the result of some scattered observations in the tails. If the author’s had used a non-parametric test instead (say, a Wilcoxon rank sum test), they would have found a non-significant difference (W = 1723.5, p-value = 0.06).

A more promising approach, I think, would use the participant’s own ratings of their response to the video, rather than condition membership. I divided participants into ‘High’ and ‘Low’ sadness levels based on whether their self-reported responses to the video were greater than or less than the midpoint of the Likert scale. Incidentally, a response of 8 (made by several participants) indicates that the sadness induced by the video was “the most you have ever felt in your life”, so their stimuli are very compelling. The results are plotted below:
sad_level_data_density
Which doesn’t look so promising. Indeed, there appears to be no correlation at all between reported sadness-level and accuracy:
rating_scatter
so it appears that their validation measure isn’t so great, even if there are group differences.

Modeling sadness and color perception

I wanted to fit a full model of the data, rather than comparing hypothesis tests, so I coded up a quick model in Stan. The data are very non-normal, so the first step was to transform the accuracy scores as in a van der Waerden test. Basically, letting X = (x_1, \dots, \x_N) be our vector of accuracy scores, we define

    \[           y_i = \Phi^{-1} \left ( \frac{R(x_i)}{N+1} \right )      \]

where R(x_i) is the rank of x_i (as in a Wilcoxon rank sum-test). The result is plotted below:
transformed_data
which looks normal enough for us to start doing some proper modelling. I settled on a simple linear model in Stan, with the color-axis effect varying as a random effect on the participant level (since it is measured as a within subject variable). Before we begin, a short note on the priors: It’s actually perfectly plausible that emotional state impacts task performance in general, so there no reason to disbelieve in a main effect of condition. It’s also plausible that there may be a difference in accuracy between red/green and blue/yellow color axes, since some colors may be more difficult to distinguish than others. The controversy is over the interaction: does “feeling blue” change the way participants see blue? If I were to construct a good prior, my first step would be browse through some previous work on perceptual disorders affecting color vision. Surely, if mood affects blue/yellow discrimination, the effect size must be much less than a proper perceptual disorder, so we could construct an informative prior that way. But, since I don’t want to be accused of unfairly biasing the results against the author’s conclusions, I’ll settle for something only weakly informative. So, letting i index the participant and j the color-axis, the model is as follows:

    \begin{align*}           y_{ij} &= \alpha + \beta_1\text{Cond}_i + \beta_{2j}\text{Color}_{ij} +                   \beta_3\text{Cond}_i\text{Color}_{ij} + \epsilon \\           \alpha &\sim \mathrm{N}(0,1) \\           \beta_1 &\sim \mathrm{N}(0,1) \\           \beta_{2j} &\sim \mathrm{N}(\mu_\text{color}, \tau_\text{color}) \\           \mu_\text{color} &\sim \mathrm{N}(0,1) \\           \beta_3 &\sim \mathrm{N}(0,1) \\           \tau^2_\text{color} &\sim \mathrm{Cauchy}(0,1) \\           \epsilon^2 &\sim \mathrm{Cauchy}(0,1) \\      \end{align*}

where \text{Cond} and \text{Color} are indicator variables defined as

    \begin{align*}           \text{Cond}:& \{\text{0 = Neutral, 1 = Sad}\} \\           \text{Color}:& \{\text{0 = Red/Green, 1 = Blue/Yellow}\}      \end{align*}

The resulting posteriors are plotted below:
blue_posterior_plot
So, we have weak evidence of a general task-effect of sadness, but no evidence whatsoever of the blue-specific effect claimed in the paper. Nothing to see here — move along.

References

[1] Thorstenson, C. A., Pazda, A. D., & Elliot, A. J. (2015). Sadness Impairs Color Perception. Psychological Science, 0956797615597672.