Factor analysis is (probably) better with shrinkage

Extending on my previous most-reported-summary-statistics-for-factor-analysis-favor-overfitting comments, which focused on the common summary statistics used to evaluate factor models, I thought it might be interesting to tackle the other side of the problem, which is that covariance estimates tend to be highly noisy and unreliable.

Given noisy data — the kind usually collected in the test batteries used by psychologists — it can take several hundred observations just to get a precise estimate of a single correlation. The covariance structure of the entire dataset is, obviously, out of the question. The result is that most empirical covariance matrices contain 1) spurious correlation structure, and 2) overestimates of the true (co)variances. The obvious solution to both of these problems is to shrink the estimated covariance matrix towards some simpler matrix, and perform factor analysis on the shrunken matrix.

The general idea behind shrinkage estimation is to maximize the fit to the data while simultaneously penalizing the distance between the estimate and some null value. In, say, ridge regression, this means shrinking the regression coefficients towards zero, so that given a data vector \mathbf{y}, a design matrix \mathbf{X}, and some parameters \beta to be estimated, our cost function is

    \[          f(\beta; \mathbf{y},\mathbf{X}) =                 \|\mathbf{y} - \mathbf{X}\beta\|^2 + \lambda \|\beta\|^2.     \]

In other words, we want to minimize the squared error, but we also want the parameters to be as small as possible. In this case, \lambda controlles how much emphasis we place on the magnitude of the parameter estimates. Under appropriate assumptions, the ridge estimates can be shown to equal posterior modes under independent normal priors on the coefficients, where \lambda is related to the precision of the prior. In this sense, ridge regression is a “natural” default choice for regression, since non-informative priors are almost never desirable anyway.

The rationale behind shrinkage estimation is based on the bias-variance decomposition of the mean-squared error (MSE). Essentially, given an estimate \hat{\theta} of a parameter \theta, the mean squared error \mathrm{E}[(\theta - \hat{\theta})^2] can be decomposed as

    \begin{align*}           \mathrm{E}[(\theta - \hat{\theta})^2]                    &= (\mathrm{E}[\hat{\theta}] - \theta)^2 + \mathrm{Var}(\hat{\theta}) \\                   &= \text{Bias}^2 + \text{Variance}      \end{align*}

The least-squares regression estimates, for example, are unbiased, but their variance can be very large, leading to high MSE. Shrinkage estimators introduce a small amount of bias in order to (potentially) strongly reduce the variance of an estimator, resulting in a net decrease in MSE.

As an interesting sidenote, consider the estimation of a normal variance from a set of observations \mathbf{x} = (x_1,x_2,\dots,x_n), and set

    \begin{align*}           s^2_n &= \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n} \\           s^2_{n-1} &= \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}      \end{align*}

Tranditionally, s^2_n is called the sample variance and s^2_{n-1} is called the population variance. This terminology is kind of silly, and is used because s^2_n (which is actually the maximum likelihood estimate) is a biased estimate of the population variance, whereas s^2_{n-1} is unbiased. But it turns out that that s^2_n is actually the better estimator in a MSE sense, though it’s possible to do even better by using n+1 in the denominator.

Anyway, back to shrinking covariance matrices. Regression coefficients can be sensibly shrunk to zero, but what constitutes a good null value for a covariance (or correlation) matrix? For factor analysis, this really depends on what you choose as your “null model”. A null model which specifies zero correlations between variables predicts a diagonal covariance structure, and so shrinking towards a diagonal matrix seems sensible. Depending on whether or not we want to shrink the variances themselves, or just the covariance structure, we could either shrink the covariance matrix \Sigma itself towards the identity (which shrinks the variances as well), or write

    \[           \Sigma = \mathrm{diag}(\sigma) \cdot \mathbf{S} \cdot \mathrm{diag}(\sigma)      \]

where \mathbf{S} is the correlation matrix and \mathrm{diag}(\sigma) is a matrix with the standard deviations on the diagonals, and then shrink \mathbf{S} towards the identity, which would shrink \Sigma towards a diagonal matrix while preserving the variances.

Alternately, we can specify a unidimensional null model, which corresponds to a correlation matrix in which all off diagonal entries are roughly similar. We can accomplish this by factorizing the covariance as above, and then shrinking the off diagonal entries in \mathbf{S} towards each other (in this case, towards their median). This is the approach I’ve chosen.

The final objective function looks something like this:

    \[\text{Estimate $\Sigma$ that}: \text{fits data well} + \text{is more unidimensional}\]

To test this, I used the same dataset as in my previous post, taken from the WISC-III. I compared the following models, roughly clockwise from the top left in order of complexity:

I compared several fit indices using various levels of shrinkage. The cross-validation estimate is the root mean squared error of the difference in off diagonal entries between the model predicted correlation matrix and the empirical (unshrunken) correlation matrix of the holdout set. CV error was estimated using 20 rounds of monte-carlo cross-validation using 20% of the data as a test set. CV error is computed relative to zero shrinkage. The results, for increasing levels of shrinkage (\lambda), are show below. Dashed lines indicate optimal shrinkage values according to cross-validation.

The results are pretty sensible. All models benefit from some level of shrinkage (as measured by CV error), with more complex models needing more. Conventional fit measures (CFI, RMSEA, and PGFI) improve with increasing shrinkage, which is to be expected since the shrunken correlation matrices have a simpler structure, and so are easier for the models to explain.

I found last time that the Weschler and Unidimensional models were by far the best, predictively speaking, whereas the Horn and Kaufman models (preferred by the CFI) tended to overfit. This is pretty much borne out by results here, which show that both models need substantial shrinkage of the covariance matrix before achieving optimal predictive performance, suggesting that the models were modeling noise in the covariance estimates.