I’ve been experimenting with techniques for robust regression, and I thought that it would be a fun excercise to implement a robust variant of the simple linear regression model based on the t-distribution.
The term “outlier” is used very loosely by most people. Usually, a field has a set of popular statistical models (e.g. I have two continuous variables, so I “do regression”), and these models make some kind normality assumption that has to be reluctantly glanced at before the model can be published. Invariably, the data don’t look quite normal, and one of two things tends to happen:
- We trust in the central limit theorem to somehow “take care of it”.
- We start prunning outliers.
Number two is problematic for several reasons: The first is that it’s usually done haphazardly, either by eyeballing it (the intra-ocular trauma test) or by using some cutoff that is, itself, not robust to outliers (e.g. “more than 3 standard deviation from the mean”). Second, it assumes that the extreme values are erroneous, and are not legitimate measurements. This is especially bad when it comes to response time data, which are never normal, even in theory, and yet people will still cut off the upper tail of the RT distribution and call it data cleanup.
In general, if you have a major problem with outliers that isn’t due to measurement error, you’re probably not defining “outlier” properly. Outliers are only outliers under some particular model (e.g. if you’re modelling the data as being normal, you’ll tend to interpret extreme values as being outliers, but those values might not be “extreme” under some other distribution). A better way to deal with outliers is to use a model that accomodates them. Example:
The standard linear regression model is written
where is the design matrix, is a vector of parameters to be estimated, is a vector of errors which are generally assumed to be normally distributed with mean zero.
The normal distribution has fairly thin tails, which makes vanilla regression extremely sensitive to outliers. For example, below are regression lines for two datasets, identical except for the presence of a single outlier
The lone outlier above causes the model to severely underestimate the true slope. If we try to extrapolate even slightly beyond the range of the data, we’re likely to be way off. The usual solution would be to delete the observation, but outliers in real datasets usually aren’t this obvious and so it’s probably better to find a solution that reduces the influence of outliers without forcing us to throw away data.
One solution to the extreme value problem is to replace the normal errors in the regression model with ones following a distribution with fatter tails, like the -distribution. The familiar -distribution used in hypothesis testing won’t do, since we can’t control it’s mean and variance, but it can be easily given location and scale parameters by writing its density function as follows:
In this form, the -distribution has mean (when ) and variance (when ). The benefit of this parametrization is that we can control the mean and variance like with the normal distribution, and the DOF parameter gives a simple way to control the fatness of the tails, and the weight given to extreme values.
We have two options here: we can treat the degrees of freedom as a given and use it as a tuning parameter to control the fatness of the tails, or we can estimate from the data along with everything else. The second case gives the model a sort of adaptive robustness, whereas the first simplifies the model considerably but requires choosing a value of a priori that the researcher thinks will give an appropriate level of robustness (lower values, in theory, result in greater robustness to outliers by making for a very fat tailed distribution).
The likelihood, with included, is given by
In this case, the maximum likelihood estimates are known to have no closed form, and so to fit the model you’ll have to plug the log-likelihood
into your favorite optimizer. I’ve tried deriving expressions for the MLE’s when is held constant, but I can’t find a closed form for these either, so it’s numerical methods all the way. This actually doesn’t work so well unless is fixed; the likelihood is too spiky. Without using more complicated methods (e.g. expectation maximization), it’s best to pick a value of a priori.
It’s not so difficult to program the estimation yourself, but if you’re using R the SMIR package has already done all of the work. Just install it with
and use it (with some simulated data) like this:
library(SMIR) # Simulate data B = c(0,1) X = seq(from = 5, to = 10, by = .5) E = rnorm(n = length(x), mean = 0, sd = 1) Y = B + B*X + E # Fit model normalModel = lm(y ~ x) tModel = treg(normalModel, r = 1.1, verbose = F)
If you prefer distributions to points, it’s only slightly more difficult to make a Bayesian version of the model. I’ve included R and Stan code in the Downloads section for a simple Bayesian t-regression model with unknown (Stan’s sampler has no problem estimating along with everything else). The folder includes the .stan model file, an R function which fits the model and outputs a neat summary and some diagnostics (I haven’t really put any effort into error handling — sorry), and a short example file fitting the model to some simulated data. The priors are pretty uninformative, but you might need to change the uniform prior on depending on your data.