Echo-state networks in R

Echo-state networks (ESN’s) are a class of neural network designed to work with temporal data. I normally don’t like neural networks, since they rarely actually tell you anything interpretable about your data (and they pretty much just do regression anyway), but ESN’s are surprisingly easy to implement, so I thought I’d try my hand at one as part of some experimenting that I’ve been doing with fMRI time series.

Echo-state networks

The ESN architecture is depicted in the figure below:
esn_architecture
There are two interesting features here, compared to the standard multi-layer perceptron. The first is that the output layer receives connections directly from the input layer. The second is that the hidden layer (called a reservoir) has recurrent connections. These recurrent connections give the network a memory which allows it to effectively learn temporally organized data.

Mathematically, the network is just linear algebra, as is shown in the figure below
esn_math
where \mathbf{u}_t is the input at time t, \mathbf{x}_t is the reservoir activation, \mathbf{v}_t is the output, and the \mathbf{W}‘s are the network weights (initialized randomly).

Given an input \mathbf{u}_t, the input \hat{\mathbf{x}}_t to the reservoir is then given by a combination of the input activation and the previous reservoir activation \mathbf{x}_{t-1} according to

    \[ 		\hat{\mathbf{x}}_t = \mathrm{tanh} 				( \mathbf{W}_{in}[1;\mathbf{u}_t] + \mathbf{W}_{res}\mathbf{x}_{t-1}) 	\]

(or some other transfer function), where [;] is columnwise concatenation (the added row of ones serves as a bias). The activation of the reservoir is then a weighted average of this input and the previous activation, so that

    \[ 		\mathbf{x}_t = (1 - \alpha)\hat{\mathbf{x}}_t + \alpha 	 				\mathbf{x}_{t-1} 	\]

where \alpha is a leak-rate parameter which determines the weight given to past activation.

Once the reservoir activation is computed, computing the output is just a matter of passing the input and reservoir layers through the output weights

    \[ 		\mathbf{v}_t = \mathbf{W}_{out} [1; \mathbf{u}_t;  				\mathbf{x}_t] 	\]

Training

This is where it gets easy. Working with perceptrons, we have separate weights for the hidden and output layers, and must train each of them separately using (usually) backpropagation. With an ESN, we train only the output weights, leaving everything else fixed.

Let \mathbf{X} be the matrix whose columns are the vectors [1; \mathbf{u}_t; \mathbf{x}_t], and let \mathbf{Y} be the matrix whose columns are the desired output activation for each time step. Then the usual way to train the network is simply by regression, so that the trained output weights solve

    \[ 		\mathbf{Y} = \mathbf{W}_{out}\mathbf{X} + \mathbf{E} 	\]

The number of weights is potentially very large, and the columns of \mathbf{X} very highly correlated, so it’s customary to use some kind of regularization to ease the fitting process and keep the weights at reasonable values. This is usually done with ridge regression, which adds a penalty term to the least squares equation corresponding to the l^2-norm of the parameter vector. The details of ridge regression are easily worth another post, but the basic properties are:

  • While least squares can fail because of the matrix \mathbf{X^TX} being non-invertible (or nearly so), the matrix (\mathbf{X^TX} + \lambda \mathbf{I}) used in ridge regression is always invertible, which means that ridge regression works well with highly correlated data.
  • The penalty on the norm of the parameter vector shrinks the estimates towards zero, which has the dual effect of introducing bias and reducing the variance of the estimates. The bias/variance tradeoff is the subject of entire papers, but gist of the penalty is that it ends up reigning in some of the extreme values that can pop up when dealing with correlated data.
  • As an interesting side note, ridge regression is intrinsically Bayesian. The ridge estimates can be shown to equal the posterior means in a model with a normal prior on the coefficient vector.

Using this approach, the output weights are given by

    \[ 		\mathbf{W}_{out} = \mathbf{YX^T}(\mathbf{XX^T} + \lambda 	 				\mathbf{I})^{-1} 	\]

where the regularization factor \lambda can either be estimated along with everything else, or chosen a priori.

It’s also possible to do “online” learning by adjusting the output weights in the direction which minimizes the squared error of the output at each timestep, but I don’t really have any need for this. Something to keep in mind though.

Building the network

The training process really only involves passing the input through the network and then doing regression, and so the bulk of the work goes into designing the network itself. More sophisticated methods will actually optimize the design of the network as part of the training process, but we’ll take the easy route and fix the design of the network a priori.

The “free parameters” in the design process are: the size of the reservoir, the design of the weight matrices \mathbf{W}_{in} and \mathbf{W}_{res}, and the selection of the leak parameter \alpha.

The size of the reservoir presents the same issue as the size of the hidden layer in a perceptron: It has to be big enough to accurately model the relationship between input and target, but small enough to avoid overfitting. There are many rules of thumb when it comes to the size of hidden layers, and most of them are nonsense, so I’m assuming that the same is true of reservoirs in ESN’s. The best way to deal with this is just to work with test data and play with the design of the network until it works.

The updating equations for the reservoir approximate the differential equation

    \[ 		\dot{\mathbf{x}} = \mathrm{tanh}(\mathbf{W}_{in}[1;\mathbf{u} 			] + \mathbf{W}_{res}\mathbf{x}) - \mathbf{x} 	\]

and so obviously (if you’re a fan of differential equations), the properties of \mathbf{W}_{res} are going to play a big role in the dynamics of the reservoir. Generally, both \mathbf{W}_{in} and \mathbf{W}_{res} are taken to be sparse (i.e. they mostly consist of zeros), but for \mathbf{W}_{res} we also have to worry about its eigenvectors/values, since the above system can potentially be chaotic, which severely limits the ability of the network to generalize beyond the training data, since very similar data could then give rise to vastly different network behaviour. There’s no way to guarantee that the system is well-behaved, but the likelihood can be subtantially increased if the eigenvalues of the matrix are sufficiently small (say, less than one), so the usual practice is to generate a sparse matrix and then divide it by its largest (in absolute value) eigenvalue.

The leak parameter \alpha is sometimes estimated along with the output weights, but I haven’t figured out how to do this, so the easier way is just to set a value in advance. This will depend on exactly how much temporal structure there is in the data (i.e. how many time steps does the network have to remember in order to do prediction?)

A simple example in R

As a simple example, I built a network that receives as input a sin wave with period 2\pi, and produces a cos wave with a shorter period.
input_target
This is a nice example, since a perceptron would fail here. Notice that that the value 1 occurs twice in the input, but is associated with a different output value each time, so the temporal structure of the data is important.

After setting up the inputs/targets, I built a network with 100 reservoir nodes, a leak rate of .5, and weights with 5% connectivity (this is actually pretty high). Here’s the output after passing the input through the untrained network.
input_untrainedOutput
Finally, after training with ridge regression
input_trainedOutput
Not perfect; the results would surely be improved with more sophisticated methods of choosing the leakage and regularization factors, and more experimentation with the reservoir, but this isn’t bad for arbitrarily chosen values.