Robust t-regression

I’ve been experimenting with techniques for robust regression, and I thought that it would be a fun excercise to implement a robust variant of the simple linear regression model based on the t-distribution.

Motivation

The term “outlier” is used very loosely by most people. Usually, a field has a set of popular statistical models (e.g. I have two continuous variables, so I “do regression”), and these models make some kind normality assumption that has to be reluctantly glanced at before the model can be published. Invariably, the data don’t look quite normal, and one of two things tends to happen:

  • We trust in the central limit theorem to somehow “take care of it”.
  • We start prunning outliers.

Number two is problematic for several reasons: The first is that it’s usually done haphazardly, either by eyeballing it (the intra-ocular trauma test) or by using some cutoff that is, itself, not robust to outliers (e.g. “more than 3 standard deviation from the mean”). Second, it assumes that the extreme values are erroneous, and are not legitimate measurements. This is especially bad when it comes to response time data, which are never normal, even in theory, and yet people will still cut off the upper tail of the RT distribution and call it data cleanup.

In general, if you have a major problem with outliers that isn’t due to measurement error, you’re probably not defining “outlier” properly. Outliers are only outliers under some particular model (e.g. if you’re modelling the data as being normal, you’ll tend to interpret extreme values as being outliers, but those values might not be “extreme” under some other distribution). A better way to deal with outliers is to use a model that accomodates them. Example:

t-Regression

The standard linear regression model is written

    \[ 		\bf{y} = \bf{X}\bf{\beta} + \bf{e} 	\]

where \bf{X} is the design matrix, \beta is a vector of parameters to be estimated, \bf{e} is a vector of errors which are generally assumed to be normally distributed with mean zero.

    \[ 		\bf{e}_i \sim \mathrm{N}(0, \sigma^2) 	\]

The normal distribution has fairly thin tails, which makes vanilla regression extremely sensitive to outliers. For example, below are regression lines for two datasets, identical except for the presence of a single outlier
outlier1
The lone outlier above causes the model to severely underestimate the true slope. If we try to extrapolate even slightly beyond the range of the data, we’re likely to be way off. The usual solution would be to delete the observation, but outliers in real datasets usually aren’t this obvious and so it’s probably better to find a solution that reduces the influence of outliers without forcing us to throw away data.

One solution to the extreme value problem is to replace the normal errors in the regression model with ones following a distribution with fatter tails, like the \mathrm{t}-distribution. The familiar t-distribution used in hypothesis testing won’t do, since we can’t control it’s mean and variance, but it can be easily given location and scale parameters by writing its density function as follows:

    \[ 		f(x;  \mu, \sigma, \nu) = \frac{\Gamma \left (\frac{\nu+1}{2} \right )} 					 { \Gamma \left (\frac{\nu}{2} \right )\sqrt{\pi \nu}\sigma } 					\left ( 1 + \frac{(x - \mu)^2}{\nu \sigma^2} \right)^{-\frac{\nu + 1}{2}} 	\]

In this form, the \mathrm{t}-distribution has mean \mu (when \nu > 1) and variance \frac{\sigma^2 \nu}{\nu-2} (when \nu > 2). The benefit of this parametrization is that we can control the mean and variance like with the normal distribution, and the DOF parameter \nu gives a simple way to control the fatness of the tails, and the weight given to extreme values.
t_dist

Estimation

We have two options here: we can treat the degrees of freedom \nu as a given and use it as a tuning parameter to control the fatness of the tails, or we can estimate \nu from the data along with everything else. The second case gives the model a sort of adaptive robustness, whereas the first simplifies the model considerably but requires choosing a value of \nu a priori that the researcher thinks will give an appropriate level of robustness (lower values, in theory, result in greater robustness to outliers by making for a very fat tailed distribution).

The likelihood, with \nu included, is given by

    \[\mathcal{L}(\beta,\sigma,\nu) = \left (\frac{\Gamma \left (\frac{\nu+1}{2} \right )} 						{\Gamma \left (\frac{\nu}{2} \right )  						\sqrt{\pi \nu}\sigma} \right )^n 							\prod_{i=1}^n \left  							( 1 + \frac{e_i^2}{\nu \sigma^2} \right) 							^{-\frac{\nu + 1}{2}}\]

In this case, the maximum likelihood estimates are known to have no closed form, and so to fit the model you’ll have to plug the log-likelihood

    \begin{align*} 		\ln \mathcal{L}(\beta,\sigma,\nu) &= 			n \left ( \ln \Gamma \left (\frac{\nu+1}{2} \right ) 			- \ln \Gamma \left (\frac{\nu}{2} \right ) 			- \ln \frac{\sqrt{\nu}}{\sigma} \right ) \\ 			&- \frac{\nu+1}{2}\sum_{i=1}^n \ln 				\left  					( 1 + \frac{e_i^2}{\nu \sigma^2}  				\right) 	\end{align*}

into your favorite optimizer. I’ve tried deriving expressions for the MLE’s when \nu is held constant, but I can’t find a closed form for these either, so it’s numerical methods all the way. This actually doesn’t work so well unless \nu is fixed; the likelihood is too spiky. Without using more complicated methods (e.g. expectation maximization), it’s best to pick a value of \nu a priori.

Usage

It’s not so difficult to program the estimation yourself, but if you’re using R the SMIR package has already done all of the work. Just install it with

install.packages("SMIR")

and use it (with some simulated data) like this:

library(SMIR)

# Simulate data
B = c(0,1)
X = seq(from = 5, to = 10, by = .5)
E = rnorm(n = length(x), mean = 0, sd = 1)
Y = B[1] + B[2]*X + E

# Fit model
normalModel = lm(y ~ x)
tModel = treg(normalModel, r = 1.1, verbose = F)

And to finish off, here’s what \mathrm{t}-regression on 1.1 degrees of freedom looks like applied to the outlier data from the previous example
outlierWithT

Bayesian t-regression

If you prefer distributions to points, it’s only slightly more difficult to make a Bayesian version of the model. I’ve included R and Stan code in the Downloads section for a simple Bayesian t-regression model with \nu unknown (Stan’s sampler has no problem estimating \nu along with everything else). The folder includes the .stan model file, an R function which fits the model and outputs a neat summary and some diagnostics (I haven’t really put any effort into error handling — sorry), and a short example file fitting the model to some simulated data. The priors are pretty uninformative, but you might need to change the uniform prior on \sigma depending on your data.