In which I complain about factor analysis

Factor analysis, like almost all forms of data analysis, is just matrix factorization. Let \mathbf{Y}_{n,k} be a matrix containing n observations of k variables (or indicators). We suppose that the variables in \mathbf{Y} can be written as linear combinations of p < k unobserved (or latent) factors, so that

    \[      \mathbf{Y} = \mathbf{AX} + \mathbf{E} \]

where \mathbf{A}_{n,p} is a matrix of coefficients (factor loadings) and \mathbf{X}_{p,k} is a matrix of latent factors. We assume here that the columns of \mathbf{Y} are centered; otherwise we can simply include an intercept term in the model. The exact probabilistic model varies, but generally errors are assumed to be normal and uncorrelated across indicators. Depending on the research problem, a the covariance of the factors may be specified or estimated freely. The model as written is unidentified, and so it is necessary to impose some kind of constraint on the solution, such as prespecifying the values of certain loadings.

Psychologists are fond of distinguishing between so-called exploratory and confirmatory factor analysis (EFA and CFA), the former of which is generally used to discover relationships between observed variables, while the later is used to test specific models. The only real difference between the two is the number of constraints, so I won’t bother making any distinction.

As a digression, psychologists are very protective of EFA, and go to great lengths to distinguish it from principal component analysis (PCA), which they regard as boorish and somehow theoretically unjustified. In textbooks and lecture notes, it’s common to hear things like

In PCA, we get components that are outcomes built from linear combinations of the items. In EFA, we get factors that are thought to be the cause of the observed indicators

despite the fact that there is nothing in either model that actually suggests such a thing. In fact, PCA can easily be derived from a latent variable model by specifying

    \begin{align*}      \mathbf{y}_i &= \mathbf{Ax}_i + \mathbf{e}_i\\      \mathbf{e}_i &\sim \mathrm{N}(0,\sigma \mathbf{I}) \\      \mathbf{a}_i &\sim \mathrm{N}(0, \mathbf{I}) \end{align*}

The maximum likelihood solutions are then the principal components. There, done. A latent factor model giving rise to PCA. This is a special case of the more general maxim that all linear statistical models are just matrix factorizations with different constraints.

Psychometrics is fond of using factor models to test theories about the relationships between different cognitive processes. In a typical experiment, researchers will administer a variety of tasks thought to target specific cognitive abilities (e.g. working memory, attention, inhibition, etc), or to tap into different facets of personality (e.g. introversion/extroversion), and then test claims about the relationships between these facets by formalizing those relationships into a factor model by e.g. specifying that certain variables load on specific factors and not others, or specifying that certain factors are un/correlated. This is, naturally, fraught with researcher degrees of freedom, and is complicated by the fact that the methods commonly used to estimate them are finicky enough that model fitting usually involves a healthy dose of “fit then remove some data then fit then remove some data then fit then remove some data” until it works. Fear not — the researcher is not obliged to describe all of this messiness in the final publication.

The process of fitting and validating a factor model is difficult, and so I’ve developed a handy flowchart that will have you publishing in JPSP in no time!

Psychometrics is a bit of an odd child, in that it is often at the forefront of mathematical and statistical techniques in psychology (factor analysis is, in many ways, a creation of psychometrics), and yet it is painfully and offensively indifferent to the proper testing and application of those techniques. As an example, psychometrics is deeply concerned with reliable measurement (indeed, courses in psychometrics often devote in large sections to the study of reliability), and yet the overwhelming majority of research in top journals makes use of noisy measurements and small samples. They don’t care. Many psychologists build careers doing nothing but comparing factor models of insert subject matter here, and yet not one of them has discovered cross-validation. Models are compared using noisy statistics which correlate almost perfectly with model complexity. They don’t care.

As an experiment, I thought it would be interesting to take a dataset from a fairly well understood task and compare a set of models using conventional fit statistics, cross-validation, and a CV-proxy that I’ve been using for a related project.

Data, model, and methods

The data were taken from Tabachnick & Fidell’s Using Multivariate Statistics, and contain measurements taken from 175 adolescents on the Weschler Intelligence Scale for Children (WISC-III). This is very small sample, but it’s difficult to find publicly available datasets, since clinical psychologists generally don’t share their data (or their methods; if you want to even look at the contents of the latest version of the WISC, you have to pay Pearson. Ugh).

The observed variables in the WISC-III and their codes are:

  • I: Information
  • S: Similarity
  • A: Arithmetic
  • V: Vocabulary
  • C: Comprehension
  • DS: Digit Span
  • PC: Picture Completion
  • PA: Picture Arrangement
  • B: Block Design
  • O: Object Assembly
  • Cd: Coding

I consider the models described by Anderson and Dixon (1995), as well as second-order and bifactor variants. Path diagrams for the models are shown below. The second-order models are obtaining by specifying zero covariance for the first order factors, and allowing all factors to load on a second order factor g. The bifactors models are obtained by adding a general factor, onto which all observed variables load.

All models were estimated using maximum likelihood with residual factor variances constrained to unity, except for the Kaufman model, which required constraining the first loading to unity in order to converge. Ugh. Models are evaluated based on the measures described below. I was particularly interested in comparing cross validation (really, the only proper way of comparing these kinds of models) with the fit indices most commonly used in the literature.

Fit Indices

– The comparative fit index (CFI) is very commonly reported in the FA literature. The CFI describes the fit of the model in question relative to some null model, such as a model specifying zero covariance between observed variables. Letting d = \chi^2 - df, where df is the degrees of freedom of the model, the CFI is computed as

    \[ \text{cfi} = \frac{d_\text{null} - d_\text{model}}{d_\text{null}} \]

The CFI technically penalizes model complexity through the df terms, but it doesn’t really do this in any kind of principled way. Like almost every other commonly used fit index, the CFI is really only concerned with the fit of the model to the training data, making it essentially worthless for model selection, despite being widely used for this purpose. By convention, a model is considered to be “adequate” if the CFI exceeds a threshold of .95.

– The Root Mean Square Error of Approximation (RMSEA) is computed as

    \[ \text{rmsea} = \sqrt{\frac{\chi^2 - df}{df(N-1)}} \]

Like the CFI, the RMSEA is concerned only with the fit of the model to the training data, again making it worthless for model comparison. Also like the CFI, the RMSEA is generally compared to a fixed threshold of either .08 or .05.

– The Penalized Goodness-of-Fit Index is one of several attempts to fix the CFI’s (and other indices) tendency to overfit. It does this by appending a multiplicative penalty term to the fit index calculation corresponding to the ratio of null and alternative model degrees of freedom. Models with larger numbers of free parameters are thus penalized. Why anyone would go to this much trouble to avoid cross-validation, I don’t know, and the statistical properties of these penalties aren’t really very well understood. Other model comparison statistics, like AIC, have well known asymptotic interpretations that justify their use, but most of the corrections applied by indices like the PGFI are very ad hoc. I’ve included it here for consistency with Anderson and Dixon, and for comparison with cross-validation., but honestly, I don’t like these kinds of “penalty” statistics. Parsimony is overrated.

Cross Validation Error
– If a model accurately describes the covariance structure of the population, then it stands to reason that the model should be able to predict the covariance structure of a new sample, but despite being the natural way to compare factor models while controlling for model complexity, cross-validation is disturbingly absent from the FA literature. In this case, I perform 20 folds of Monte-Carlo cross validation with a test set consisting of 20% of the sample. Letting \hat{\Sigma}_\text{pred} be the correlation matrix predicted by the fitted model, and \hat{\Sigma}_\text{obs} be the observed correlation matrix in the test set, I define the cross-validation error to be

    \[ \text{cv}_\text{err} = \|\hat{\Sigma}_\text{pred} - \hat{\Sigma}_\text{obs}\|_F \]

where \|\cdot\|_F is the Frobenius norm.

Parametric Cross Validation Error
– I call this “Parametric” cross-validation for lack of a better term. Lack of open data is a serious problem in the clinical psychology literature, and the result is that most reported factor analyses report only the observed correlation matrix. Obviously, we can’t do cross validation on a single reported matrix, so I wanted to know if I could get around this problem by making a few assumptions.

Note that factor analysis traditionally assumes that the data are mutivariate normal with some correlation \Sigma, which is assumed to arise from some particular factor structure. Letting \hat{\Sigma}_\text{obs} be the observed correlation matrix from our sample of size N, we simulate k new datasets, each of size N, from \mathrm{N}(0,\hat{\Sigma}_\text{obs}), and compute correlation matrices \hat{\Sigma}_i from each of them. Factor models can then be fit to each of these simulated correlation matrices, and their predictions tested against the others. In other words, instead of binning the data into training and testing sets, we generate simulated data under the normality assumptions of the model, train on some sets, and test on others. I wanted to so if this approach would lead to qualitatively similar conclusions as “true” cross-validation.

Others not included
– There are plenty of other indices commonly used for model comparison in the clinical literature. The raw \chi^2 value is often reported, but I’ve left it out here because it’s completely uninterpretable and is a terrible way of doing inference. Seriously. Stop it. Chi-squared is old technology, like Gibbs sampling or conjugate priors or cable television. Stop giving Comcast what they want.

Others, like the various “information criteria”, such as AIC and BIC, are often used to balance fit to data and model parsimony, though I’ve left them out here because I don’t believe that parsimony is particularly important. Complexity is only bad insofar as it results in overfitting, and cross-validation takes care of that anyway. Models should be as complex as they need to be in order to accurately describe the world. Note that the AIC is actually asymptotically equivalent to leave-one-out cross-validation, but we don’t actually care about that, since we don’t really know how quickly the convergence takes place, and at any rate, most FA is done with small samples and noisy data.


Results for each model are displayed in the figure below. Second-order versions of the model described above are denoted by “2O”, while bifactor versions are denoted by “BF”. One of my favorite points of note is that the unidimensional model, despite performing worse than any other model according to the conventional fit statistics (so terribly that it would be rejected using the most common cutoffs) is, predictively, one of the best models. In other words, if you want to know how kids are going to perform on the WISC-III, go with the unidimensional model. In fact, of the five-ish models that perform “adequately” according to the CFI and RMSEA, three of them are among the worst at prediction, suggesting that both of these measures favor overfitting. The PGFI actually gets it more or less right when it comes to the unidimensional model, but still favors several models which perform poorly (like the Horn models).

The second point of note is that our “parametric cross-validation” performs just like the real thing. The numerical values are off, but the inference is identical. This is actually really cool: We can do cross-validation given only the observed sample correlation matrix.

The overall best performing model is Weschler, which decomposes the observed variables into Verbal and Performance factors. All versions of this model perform about equally, suggesting that the added complexity of the second-order and bifactor versions don’t really buy you much. Amusingly, the bifactor version, which is (predictively) the worst performing of the three, is the only one deemed “adequate” by the CFI and RMSEA.

Note that the more complex Guilford and Bannadyne models, which fit well according to the RMSEA and CFI, perform the worst predictively, suggesting that they are far too complex and are just overfitting the data.