by

Bayesian functional linear models pt.1 – Estimating a mean

This post describes step 1 of my quest to build a fully Bayesian general linear model for functional data. I haven’t done it yet, and any solution is likely to be very computationally expensive, but so far I’ve had a few limited successes.

The general problem is this: My own research involves tracking the position or rotation of a limb (generally, the hand) as it performs some kind of motor action under a variety of conditions. The object of interest is the entire trajectory of the hand, and thus a single data point is actually a smooth curve through some kind of space (\mathbb{R}^n for position, and something like S^3 for rotation, depending on how it’s parametrized). This kind of data is generally called functional data, since the data are usually assumed to lie in some Hilbert space and most common statistical techniques for dealing with this kind of data make heavy use of functional analysis as a general theory. The problem is that we want to characterize differences between curves, but we don’t want to take some kind of mass-univariate approach and just perform pointwise hypothesis test (as is common in EEG and fMRI) because that’s just silly.

The first step in any solution to this problem is to create a model for the mean of a set of functional observations, which we can then (hopefully) extend to more complex models of a mean conditional on a set of predictors. My general approach is like this:

The Model

Data and notation

The details don’t matter, but for concreteness suppose that, in our experiment, subjects repeatedly reach forward and point to the center a computer screen. We record the lateral (left-right) position of the sensor at each instant during the reach, giving a curve describing the lateral position over time. For simplicity, suppose further that all curves are of identical length and are sampled at identical timepoints. Generally, this is accomplished by reparametrizing the curves according to, say, proportion of total reach time, or proportion of total reach distance. However the curves are parametrized, we’ll refer to the parameter as the time point.

Denote by \mathbf{y}_i the vector containing the i‘th curve, so that y_{ij} is the value of the i‘th curve at the j‘th time point. Then let \mathbf{Y}_{n,k} be the matrix whose rows are the n observed curves, and whose columns are the k sampled time points. We denote the vector of measured time points by \mathbf{v}.

Modelling a functional mean

Let’s use a naive approach as our starting point: Since all of our curves are sampled at identical time points, simply define

    \[        \boldmath$\mu$_j = \frac{1}{n}\sum_{i=1}^n y_{i,j}    \]

In other words — the pointwise mean. The problem is that this implies a model that looks something like

    \begin{align*}         y_{ij} &= \mu_j + \epsilon_j \\         \epsilon &\sim \mathrm{N}(0,\sigma_j)    \end{align*}

which has two problems: First, there are too many parameters. We have to estimate k means and (if we want to build CIs or do hypothesis testing) k variances. Second, it falsely assumes that all of those parameters are independent — that mean/variance at time j is independent of the mean/variance at time j+1, which is intuitively unreasonable because the curves are smooth, and so the behaviour at time t+1 is very similar to the bevariour at time t. We need a way of imposing some kind of principled structure on the model that reflects this.

Modelling the mean

First, the mean. Curves can be quite general in shape, and so we don’t want to impose any kind of specific functional form on \mu, such as a straight line or polynomial. We can, however, do a bit better than pointwise estimation, because we know that, whatever \mu is, it has to be relatively smooth (since the hand cannot teleport through space). There are a few ways to do this, but a simple way is to assume that \mu can be represented using a cubic spline basis with either relatively few basis elements, or with some kind of smoothness penalty. Neither are particularly difficult, but I opt for the former approach here out of concern for computing time.

The basis idea, in a concrete sense, is that we construct a matrix \mathbf{B} whose columns are basis functions which can be summed up to approximate an arbitrary function \mu. The coefficients of the sum are given by a parameter vector \mathbf{a}, so that we have

    \[       \mu \approx \mathbf{Ba}    \]

this is, essentially, a regression problem where the design matrix \mathbf{B} consists of basis functions rather than measured predictors. By using a relatively small number of basis functions, or by placing the appropriate penalty on \mathbf{a}, we can ensure that the estimate is smooth (see a ways below for a plot of the basis functions). The best part is that, because we have framed spline fitting as a regression problem, it’s no trouble at all to estimate \mu in a fully Bayesian way.

Our model so far looks like

    \begin{align*}         \mathbf{y}_{i} &= \mu + \mathbf{e}_i \\         \mu &= \mathbf{Ba}    \end{align*}

where \mathbf{e}_i is some vector of residuals with some structure we need to specify. Note that, depending on the application, we might also add as intercept term to the model, though that will not be necessary in the simulation below.

Modelling the residual covariance

What should a residual curve look like? A naive model might specify independent normal errors, which is unreasonable after a moment’s thought, because such a model predicts that residual curve should look like white noise, which cannot be the case because, again, trajectories are smooth.

Lets start by assuming, however they might correlate, that the errors are normal, so that we have

    \[  \mathbf{y}_{i}|\mathbf{a},\Sigma &\sim \mathrm{N}(\mathbf{Ba}, \Sigma) \]

The final step is then to specify the error covariance \Sigma. The naive model specifies \Sigma = \sigma^2 \mathbf{I}, but we can do better. We certainly don’t want to estimate \Sigma freely, but we do want to assume that errors at nearby time points are correlated. There are a few ways to do this, but a simple and flexible way is by a smoothness enforcing covariance kernel, such as an exponential quadratic

    \[       f(s,t|\alpha,\rho,\sigma) &= \alpha^2 \exp \left ( -\frac{1}{2\rho^2}(s-t)^2 \right )                + \delta_{s,t} \sigma^2     \]

where s and t are two time points. This function is maximal at s=t, and drops off exponentially as s and t move further apart. Setting \Sigma_{s,t} = f(s,t) thus specifies that observations which are nearby in time should be more highly correlated than observations which are further apart. The parameter \alpha controls the standard deviation (how high or low the error curve can be), while the parameter \rho is called the length scale, and controls the smoothness of the resulting curves. The parameter \sigma on the diagonal ensures that the resulting covariance matrix is positive-definite, and can be thought of as a noise term.

Our model is then

    \begin{align*}         \mathbf{y}_{i}|\mathbf{a},\Sigma &\sim \mathrm{N}(\mathbf{Ba}, \Sigma) \\         \Sigma_{s,t} &= f(s,t|\alpha,\rho,\sigma)    \end{align*}

along with appropriate priors. The coefficients \mathbf{a} are generally easy to estimate, so it’s not necessary to use anything more than a weakly regularizing prior (as a rule of thumb, take whatever you expect the parameter to be at most, add a little to it, and use that value as the standard deviation for a centered normal prior).

The parameters for the covariance function are trickier, and it’s definitely worth constructing a fairly strong prior, especially for \rho. If the time scale consists of unit interval (say, when parametrizing as proportion of movement time) and the curves aren’t expected to cross zero more than once or (occasionally) twice, then we might expect a value on the order of \rho ~ .3 or so, and so something like a Gamma(5,15) allows for a sensible range of values. We would generally expect the noise term \sigma to be quite small, since our measurement are very precise; still, it’s important to set a boundary avoiding prior as otherwise the posterior might cluster around 0 and lead to problems with the positive-definiteness of the covariance \Sigma.

The final model

Our final model, priors and all, is

    \begin{align*}         \mathbf{y}_{i}|\mathbf{a},\Sigma &\sim \mathrm{N}(\mathbf{Ba}, \Sigma) \\         \Sigma_{s,t} &= f(s,t|\alpha,\rho,\sigma) \\         \mathbf{a} &\sim \mathrm{N}() \\         \alpha &\sim \mathrm{N}() \\         \rho &\sim Gamma(5,15) \\         \sigma &\sim Gamma();    \end{align*}

where the exact parameters of the priors depend on the scale of the problem, and the prior on \rho assumes that the residuals are fairly smooth on [0,1].

Testing the model

We simulate a small set of functional observations sampled at 25 time points, having the true mean plotted below:

We then simulate 5 error trajectories from a Gaussian process with the covariance function specified above. The true values of the parameters are \alpha = .01, \rho = .3, and \sigma = .0005. The simulated error trajectories are plotted below:

We can then combine the two to get a set of observed trajectories.

We estimate the mean trajectory using a cubic spline basis with 10 knots.

Finally, we estimate the model using the following priors:

    \begin{align*}         \mathbf{a} &\sim \mathrm{N}(0, .1) \\         \alpha &\sim \mathrm{N}(0, .1) \\         \rho &\sim Gamma(5,15) \\         \sigma &\sim Gamma(2,10);    \end{align*}

Each posterior draw of \mathbf{a} then implies a mean trajectory \mu. By plotting a large number of them, we can get a good visualization of the posterior mean trajectory.

Finally, we see that posterior mean residuals do indeed resemble the assumed Gaussian process curves.

So the model does a good job even when the number of observations is quite small. The best thing about the model, though, is that it can easily be extended to more general regression/ANOVA settings, where we want to estimate a mean conditional on a set of predictors.

As usual, code for fitting the model and replication the simulation is on Github.