Linear models for quantiles

Most statistical models used in experimental psychology are designed to estimate the mean of a response variable given some set of predictors. This is all well and good when errors are largely symmetric and our predictors are expected to primarily affect the location of the distribution. In practice, it’s often useful to model other properties of a response variable, such as its quantiles. As an example, consider the use of reaction times (RT’s) as an outcome measure in psychology. RT’s are generally positively skewed, and the precise shape of the distribution is known to carry useful information about underlying cognitive processes (see e.g. diffusion modeling).

Reaction times are often compared across experimental manipulations using their means, but certain manipulations are known to preferentially affect certain quantiles. For example, some priming affects are only apparent in the fastest responses, while others don’t appear unless the participant has waited longer than usual to respond. Not only can these effects disappear when averaging over the entire distribution, but they carry useful information about the phenomenon being studied. In psychology, the most common way to approach this problem is to partition the observed RT’s into quantiles and analyze the responses in each bin using standard techniques, such as regression or ANOVA. Unfortunately, the data tend to be highly variable and non-normal, especially in the most extreme quantiles, so it would be best to have a way of modeling quantiles explicitly. After working my way through a few papers on the subject, I though I’d try throwing together a simple model in R/Stan. First, some preamble:

Quantile regression

The goal of classical regression is to model the conditional mean of an outcome variable as a function of a set of predictors, so that, given a vector {\bf y} of outcomes, a design matrix {\bf X}, and a vector of unknown parameters \beta, we have (say, for a linear model)

    \[ \mathrm{E}(y_i|{\bf x}_i) = {\bf x}_i \beta \]

When the errors are normal and homoskedastic (as is usually assumed), this reduces to the standard regression model

    \[ y_i|{\bf x}_i \sim \mathrm{N}({\bf x}_i \beta, \sigma^2) \]

which is usually estimated by minimizing the sum of squared errors, so that

    \[ \hat{\beta} =          \underset{\beta}{\text{arg min}} \sum_{i=1}^n (y_i - {\bf x}_i\beta)^\top                     (y_i - {\bf x}_i\beta)\]

Similarly, a regression model for the q‘th quantile is estimated by minimizing a loss function based on

    \[ p_q(x) = \frac{|x| + (2q-1)x}{2} \]

so that the estimates are given by

    \[ \hat{\beta}(q) =          \underset{\beta}{\text{arg min}} \sum_{i=1}^n p_q(y_i - {\bf x}_i\beta) \]

Like most useful estimators, it turns out that \hat{\beta}(q) has a Bayesian interpretation as the posterior mode under flat priors when the likelihood is an asymmetric Laplace distribution (ALD), which has a density given by

    \[ f_q(x) = q(1-q)e^{-p_q(x)} \]

so a regression model for the q‘th quantile of y gives the likelihood function

    \[ \mathcal{L}_q(\beta) = q^n(1-q)^n e^{-\sum_{i=1}^n p_q(y_i - {\bf x}_i^\top \beta)} \]


    \[ \log{\mathcal{L}_q(\beta)} = n\log{q} + n\log{(1-q)} -          \sum_{i=1}^n p_q(y_i - {\bf x}_i^\top\beta)\]

Note that the likelihood is more of a computational convenience than an attempt at an accurate model (i.e. the data probably aren’t generated from an ALD, but assuming that they are produces useful estimates), so any posterior distribution computed from the likelihood probably shouldn’t be interpreted as literally describing our certainty about the quantile. Also, especially if the posterior is skewed (as is more likely with more extreme quantiles), the posterior mode is a better estimate than the mean in this particular model.

Quantile regression in Stan

I put together a simple version of the model in Stan, with weakly informative priors on \beta (though they should be changed to suit the research problem):

functions {
    real q_loss(real q, real u){
        return .5 * (fabs(u) + (2*q - 1)*u);

    real log_ald(real q, real u){
        return log(q) + log(1-q) - q_loss(q, u);

data {
    int N;                 // Number of observation
    int M;                 // Number of predictors
    int K;                 // Number of quantiles
    vector[K] q;           // Quantiles to be estimated
    vector[N] y;           // Response variable
    vector[M] X[N];        // Design matrix

parameters {
    vector[M] beta[K];

model {
    for (i in 1:N){
        for (j in 1:K){
                    y[i] - dot_product(beta[j], X[i])));
    for (i in 1:K){
        beta[i] ~ normal(0,10);

I tested it first on a very simple dataset consisting of 100 draws from a standard normal distribution. The model was used to estimate q = (.3, .5, .7). The true quantiles, as well as posterior distributions for the estimates, are displayed below:


Notice that the upper (lower) quantiles tend to be over (under) estimated. The sample quantiles become more and more variable as we move away from the median, so we generally need large samples.

As a slightly more interesting example, consider 1000 observations drawn from the model y_i \sim \mathrm{N}(0, \frac{3}{1000}x_i + 1), so that the error variance increases as x increases, from 1 when x = 0 to 4 when x = 1000. We want to estimate three quantiles — .3, .5, and .7 — using the model q(y_i) = \alpha + \beta x_i (in other words, we assume that each quantile varies as a linear function of x). The data, along with the true and estimated (posterior mode) quantiles, are plotted below:


Of course, this all started as a way to analyze reaction times, so as a final test, we simulate 200 reactions times from each condition in a 2^2-factorial design by drawing from an ex-Gaussian distribution (the convolution of a normal distribution with an exponential distribution — a common model for reaction times). Let the factors U and V denote the presence or absence of some manipulation, then the RT’s are generated as follows:

    \begin{align*}           \text{UV absent} &\sim \mathrm{exGauss}(.5, .1, .1) \\           \text{U present} &\sim \mathrm{exGauss}(.5, .1, .2) \\           \text{V present} &\sim \mathrm{exGauss}(.65, .1, .1) \\           \text{UV present} &\sim \mathrm{exGauss}(.65, .2, .2)      \end{align*}

Just to be clear about what’s happening here: U results in a longer upper tail on the RT distribution, which should result in larger upper quantiles. V shifts the entire distribution upward, and both together result in both effects together plus greater overall variability. The model includes both factors, as well as an interaction term, giving

    \[ q(y_i) = \alpha + \beta_1u_i + \beta_2v_i + \beta_3u_iv_i + \epsilon \]

where once again we model the .3, .5, and .7 quantiles. The RT distributions, along with the true (solid) and estimated (dashed) quantiles, for each condition are plotted below:


The posterior distributions for each parameter are plotted below, along with the true values (dashed lines):


Note that, although the model systematically under (over) estimates the true lower (upper) quantiles, it actually does a very good job of estimating the direction and magnitude of the change in each quantile across conditions.

Unfortunately, the amount of data required to get good estimates of the more extreme quantiles makes the model difficult to use. A hierarchical version would probably work much better in practice, since the shrinkage would help you. Based on my simulations, anything less than 200 observations per condition produces estimates which are too variable to be useful.