This post describes step 1 of my quest to build a fully Bayesian general linear model for functional data. I haven’t done it yet, and any solution is likely to be very computationally expensive, but so far I’ve had a few limited successes.
The general problem is this: My own research involves tracking the position or rotation of a limb (generally, the hand) as it performs some kind of motor action under a variety of conditions. The object of interest is the entire trajectory of the hand, and thus a single data point is actually a smooth curve through some kind of space ( for position, and something like
for rotation, depending on how it’s parametrized). This kind of data is generally called functional data, since the data are usually assumed to lie in some Hilbert space and most common statistical techniques for dealing with this kind of data make heavy use of functional analysis as a general theory. The problem is that we want to characterize differences between curves, but we don’t want to take some kind of mass-univariate approach and just perform pointwise hypothesis test (as is common in EEG and fMRI) because that’s just silly.
The first step in any solution to this problem is to create a model for the mean of a set of functional observations, which we can then (hopefully) extend to more complex models of a mean conditional on a set of predictors. My general approach is like this:
The Model
Data and notation
Data and notation
The details don’t matter, but for concreteness suppose that, in our experiment, subjects repeatedly reach forward and point to the center a computer screen. We record the lateral (left-right) position of the sensor at each instant during the reach, giving a curve describing the lateral position over time. For simplicity, suppose further that all curves are of identical length and are sampled at identical timepoints. Generally, this is accomplished by reparametrizing the curves according to, say, proportion of total reach time, or proportion of total reach distance. However the curves are parametrized, we’ll refer to the parameter as the time point.
Denote by the vector containing the
‘th curve, so that
is the value of the
‘th curve at the
‘th time point. Then let
be the matrix whose rows are the
observed curves, and whose columns are the
sampled time points. We denote the vector of measured time points by
.
Modelling a functional mean
Let’s use a naive approach as our starting point: Since all of our curves are sampled at identical time points, simply define
In other words — the pointwise mean. The problem is that this implies a model that looks something like
which has two problems: First, there are too many parameters. We have to estimate means and (if we want to build CIs or do hypothesis testing)
variances. Second, it falsely assumes that all of those parameters are independent — that mean/variance at time
is independent of the mean/variance at time
, which is intuitively unreasonable because the curves are smooth, and so the behaviour at time
is very similar to the bevariour at time
. We need a way of imposing some kind of principled structure on the model that reflects this.
Modelling the mean
First, the mean. Curves can be quite general in shape, and so we don’t want to impose any kind of specific functional form on , such as a straight line or polynomial. We can, however, do a bit better than pointwise estimation, because we know that, whatever
is, it has to be relatively smooth (since the hand cannot teleport through space). There are a few ways to do this, but a simple way is to assume that
can be represented using a cubic spline basis with either relatively few basis elements, or with some kind of smoothness penalty. Neither are particularly difficult, but I opt for the former approach here out of concern for computing time.
The basis idea, in a concrete sense, is that we construct a matrix whose columns are basis functions which can be summed up to approximate an arbitrary function
. The coefficients of the sum are given by a parameter vector
, so that we have
this is, essentially, a regression problem where the design matrix consists of basis functions rather than measured predictors. By using a relatively small number of basis functions, or by placing the appropriate penalty on
, we can ensure that the estimate is smooth (see a ways below for a plot of the basis functions). The best part is that, because we have framed spline fitting as a regression problem, it’s no trouble at all to estimate
in a fully Bayesian way.
Our model so far looks like
where is some vector of residuals with some structure we need to specify. Note that, depending on the application, we might also add as intercept term to the model, though that will not be necessary in the simulation below.
Modelling the residual covariance
What should a residual curve look like? A naive model might specify independent normal errors, which is unreasonable after a moment’s thought, because such a model predicts that residual curve should look like white noise, which cannot be the case because, again, trajectories are smooth.
Lets start by assuming, however they might correlate, that the errors are normal, so that we have
The final step is then to specify the error covariance . The naive model specifies
, but we can do better. We certainly don’t want to estimate
freely, but we do want to assume that errors at nearby time points are correlated. There are a few ways to do this, but a simple and flexible way is by a smoothness enforcing covariance kernel, such as an exponential quadratic
where and
are two time points. This function is maximal at
, and drops off exponentially as
and
move further apart. Setting
thus specifies that observations which are nearby in time should be more highly correlated than observations which are further apart. The parameter
controls the standard deviation (how high or low the error curve can be), while the parameter
is called the length scale, and controls the smoothness of the resulting curves. The parameter
on the diagonal ensures that the resulting covariance matrix is positive-definite, and can be thought of as a noise term.
Our model is then
along with appropriate priors. The coefficients are generally easy to estimate, so it’s not necessary to use anything more than a weakly regularizing prior (as a rule of thumb, take whatever you expect the parameter to be at most, add a little to it, and use that value as the standard deviation for a centered normal prior).
The parameters for the covariance function are trickier, and it’s definitely worth constructing a fairly strong prior, especially for . If the time scale consists of unit interval (say, when parametrizing as proportion of movement time) and the curves aren’t expected to cross zero more than once or (occasionally) twice, then we might expect a value on the order of
or so, and so something like a
allows for a sensible range of values. We would generally expect the noise term
to be quite small, since our measurement are very precise; still, it’s important to set a boundary avoiding prior as otherwise the posterior might cluster around 0 and lead to problems with the positive-definiteness of the covariance
.
The final model
Our final model, priors and all, is
where the exact parameters of the priors depend on the scale of the problem, and the prior on assumes that the residuals are fairly smooth on
.
Testing the model
We simulate a small set of functional observations sampled at 25 time points, having the true mean plotted below:
We then simulate 5 error trajectories from a Gaussian process with the covariance function specified above. The true values of the parameters are ,
, and
. The simulated error trajectories are plotted below:
We can then combine the two to get a set of observed trajectories.
We estimate the mean trajectory using a cubic spline basis with 10 knots.
Finally, we estimate the model using the following priors:
Each posterior draw of then implies a mean trajectory
. By plotting a large number of them, we can get a good visualization of the posterior mean trajectory.
Finally, we see that posterior mean residuals do indeed resemble the assumed Gaussian process curves.
So the model does a good job even when the number of observations is quite small. The best thing about the model, though, is that it can easily be extended to more general regression/ANOVA settings, where we want to estimate a mean conditional on a set of predictors.
As usual, code for fitting the model and replication the simulation is on Github.